Comma v0.1 converted to GGUF

I was able to convert a GGUF version of Comma-v0.1 and get it working locally in Ollama.

This is, as far as I know, the first and still only LLM trained only on public domain and opt-in data. I'm glad to have this for my own experiments and toys: https://huggingface.co/jadael/comma-v0.1-2t-GGUF

It's a base model, whereas Ollama typically expects chat/assistant or instruct fine-tuned models. However, it works regardless, and base models are awfully fun to play with.

The conversion ended up being a little tricky; I had tried it myself earlier this year (July 2025) when it Comma v.01 first came out, but struggled to unpack what was going on in llama.cpp.

Today Claude Code with Sonnet 4.5 managed to figure it out. Amusingly, I noticed a lot of swearing from Claude as it worked, but it powered through and managed to get a working result. Since it took so much fiddling and swearing, I asked for Claude's account of the process:

Conversion Story

- Converting Comma v0.1-2T to GGUF: A Technical Journey
  Introduction
  When I set out to convert Comma v0.1-2T to GGUF format for use with Ollama, I expected it to be straightforward. After all, it's a Llama 3 architecture model, and llama.cpp has excellent support for Llama models. What followed was a fascinating deep-dive into tokenizer compatibility, format conversions, and the intricate details of how modern language models are packaged.
  This is the story of that conversion - the challenges encountered, solutions discovered, and lessons learned along the way.
  Why Comma v0.1?
  Comma v0.1 is special. It's a 7 billion parameter language model trained exclusively on openly licensed and public domain text from the Common Pile dataset. In an era where training data provenance is increasingly scrutinized, Comma represents a principled approach: competitive performance achieved using only ethically sourced training data.
  The model comes in two variants:
  - Comma v0.1-1T: Trained on 1 trillion tokens
  - Comma v0.1-2T: Trained on 2 trillion tokens (the one we converted)
  Performance benchmarks show Comma v0.1-2T is competitive with Llama 2 7B, OLMo, and DeepSeekLLM on knowledge-intensive and coding tasks. But despite this impressive pedigree, no GGUF conversion existed - making it inaccessible to the llama.cpp and Ollama communities.
  The Initial Attempt
  The conversion process seemed simple enough:
```
git clone https://github.com/ggerganov/llama.cpp  
cd llama.cpp  
pip install -r requirements.txt  
python convert_hf_to_gguf.py ../comma-v0.1-2t --outfile comma.gguf --outtype f16  
```
  The script started well - it recognized the Llama architecture, loaded the model weights, processed all 32 transformer blocks... and then crashed.
  Challenge #1: The Missing Tokenizer
  The first error was straightforward:
```
FileNotFoundError: File not found: tokenizer.model  
```
  Llama.cpp's converter was looking fortokenizer.model- a SentencePiece format tokenizer file common in older Llama models. But Comma v0.1 uses the newer HuggingFacetokenizer.jsonformat instead.
  This wasn't a showstopper - llama.cpp supports both formats. The converter should have fallen back to readingtokenizer.json. So why was it failing?
  Challenge #2: The Unrecognized Checksum
  The real error came next:
```
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

chkhsh: bf66900d65fe80247e435184a4ac839c5c332657cf567e64b8ede5fbd63f5fd9  
```
  This was more interesting. The llama.cpp converter uses cryptographic checksums to identify known tokenizer formats. When it encounters a new tokenizer, it computes a checksum of the tokenizer configuration and looks it up in a hardcoded table of known tokenizers.
  Comma v0.1's tokenizer - while functionally identical to Llama 3's BPE tokenizer - had a unique checksum that wasn't in llama.cpp's database. The converter literally didn't know this tokenizer existed.
  The Detective Work
  I examined the error more carefully. The converter helpfully prints the problematic checksum and suggests updating theget_vocab_base_pre()function. This function lives inconvert_hf_to_gguf.pyaround line 900.
  Looking at the code, I found a long chain of checksum comparisons:
```
if chkhsh == "some_checksum_1":  
   res = "llama-bpe"  
if chkhsh == "some_checksum_2":  
   res = "gpt2"  
if chkhsh == "some_checksum_3":  
   res = "llama4"  
# ... many more ...

if res is None:  
   logger.warning("BPE pre-tokenizer was not recognized!")  
   raise NotImplementedError(...)  
```
  The solution was clear: add Comma's checksum to this list and map it to the appropriate tokenizer type.
  Finding the Right Mapping
  But which tokenizer type should Comma map to? The model uses Llama 3's architecture, and examining thetokenizer.jsonconfirmed it uses Byte-Pair Encoding (BPE) with a structure very similar to Llama 3.
  I found several existing Llama 3 models in the converter's checksum list, all mapping to"llama-bpe". This was the answer.
  The First Patch Attempt
  I created a patch script (patch_converter.py) to automate the fix:
```
patch_code = f'''  
   # Patch for Comma v0.1 tokenizer (Llama 3 compatible)  
   if chkhsh == "{failing_checksum}":  
       # Comma v0.1 uses Llama 3 style BPE  
       res = "llama-bpe"  
'''  
```
  But where to insert it? My first attempt placed the patch after the warning block - which meant it would execute but then the unconditionalraise NotImplementedErrorwould fire anyway.
  Challenge #3: Control Flow
  The bug was subtle. Looking at the original code:
```
if res is None:  
   logger.warning("...")  
   logger.warning("...")  
   # ... more warnings ...

# My patch was here  
if chkhsh == "bf66900d...":  
   res = "llama-bpe"

raise NotImplementedError(...)  # This always ran!  
```
  Theraisestatement was unconditional - it would execute whether or notreswas set! This was clearly a bug in my understanding of the code structure.
  The fix: Theraiseshould be inside theif res is None:block. That way it only raises an error if no tokenizer was matched.
  The Corrected Solution
  The final patch placed the checksum check with the other checksum checks (before theif res is None:), and moved theraiseinside the error block:
```
# Add Comma's checksum with the others  
if chkhsh == "bf66900d65fe80247e435184a4ac839c5c332657cf567e64b8ede5fbd63f5fd9":  
   # Patch for Comma v0.1 tokenizer (Llama 3 compatible)  
   # ref: https://huggingface.co/common-pile/comma-v0.1-2t  
   res = "llama-bpe"

# Check if any tokenizer matched  
if res is None:  
   logger.warning("BPE pre-tokenizer was not recognized!")  
   raise NotImplementedError(...)

# Continue with conversion  
logger.debug(f"tokenizer.ggml.pre: {repr(res)}")  
return res  
```
  Challenge #4: Quantization Types
  With the tokenizer issue fixed, I tried the conversion again:
```
python convert_hf_to_gguf.py ../comma-v0.1-2t \  
   --outfile comma-v0.1-2t.gguf --outtype q4_K_M  
```
  New error:
```
argument --outtype: invalid choice: 'q4_K_M' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')  
```
  It turns outconvert_hf_to_gguf.pyonly does basic conversions. Advanced quantization types like Q4_K_M require a separate step usingllama-quantize.
  Since we wanted the full-precision version anyway (to preserve quality and allow users to quantize to their preferred format later), we used F16:
```
python convert_hf_to_gguf.py ../comma-v0.1-2t \  
   --outfile comma-v0.1-2t-f16.gguf --outtype f16  
```
  Success!
  The conversion took about 15 minutes, processing all 291 tensors and converting weights from bfloat16 to F16 format. The progress output was beautiful:
```
INFO:hf-to-gguf:token_embd.weight,    torch.bfloat16 --> F16, shape = {4096, 64256}  
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}  
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> F16, shape = {11008, 4096} 
...  
INFO:gguf.vocab:Adding 63753 merge(s).  
INFO:gguf.gguf_writer:Writing comma-v0.1-2t-f16.gguf: n_tensors = 291, total_size = 14.0G  
Writing: 100%|██████████| 14.0G/14.0G [14:23<00:00, 16.2Mbyte/s]  
```
  The final file: 14GB of pure F16 precision goodness.
  Testing with Ollama
  Creating a Modelfile for a base model required some thought. Base models don't follow instructions - they complete text. The Modelfile needed to reflect this:
```
FROM ./comma-v0.1-2t-f16.gguf

TEMPLATE """{{ .Prompt }}"""

PARAMETER stop "<|end_of_text|>"  
PARAMETER temperature 0.7  
PARAMETER top_p 0.9  
PARAMETER repeat_penalty 1.1

# Note: This is a base model (not instruction-tuned).  
# It will continue text rather than follow instructions.  
```
  The key insight: no SYSTEM prompt. System prompts are for instruction-tuned models that understand roles and directives. Base models just see more text.
  Import and test:
```
ollama create comma-v0.1-2t -f Modelfile  
ollama run comma-v0.1-2t  
```
  The first generation was delightful - given the beginning of a conversation, it naturally continued it:
```
Hello, how can I aid you?  
Can you please help me in booking my flight ticket from OAK to DCA?  
Sure, may I know your trip dates?  
My planned travel dates are Dec 17th and Dec 19th.  
Do you have any specifications?  
I am looking for a connecting flight which returns at evening time.  
[continues naturally with flight booking dialogue...]  
```
  Perfect! The model was working exactly as expected for a well-trained base model.
  Lessons Learned
  1. Tokenizer Checksums are Fragile
  The checksum-based tokenizer identification is clever but brittle. Any change to the tokenizer configuration - even cosmetic ones - creates a new checksum that won't be recognized. This is actually good for security (prevents accidental tokenizer mismatches) but bad for new models.
  Solution for the community: Submit tokenizer checksums upstream to llama.cpp so future users don't hit this issue.
  2. Model Format Evolution
  The shift from SentencePiece (tokenizer.model) to HuggingFace JSON format (tokenizer.json) is ongoing. Tools need to handle both gracefully. The llama.cpp converter does this well, but the error messages could be clearer about which format is actually missing.
  3. Base Models Need Different UX
  The distinction between base models and instruction-tuned models is often lost in documentation. Base models:
  - Complete text naturally
  - Don't understand "system" vs "user" roles
  - Work best with natural continuations
  - Are ideal for fine-tuning
  Instruction-tuned models:
  - Follow directives and prompts
  - Understand role-based dialogue
  - Can be "told" what to do
  - Are ready for end-user chat
  Our Modelfile and documentation needed to make this crystal clear.
  4. F16 is Underrated
  Everyone jumps to quantized models (Q4, Q5, Q8) to save space. But F16 preserves the full precision of the original model while being significantly smaller than F32. For systems with sufficient VRAM, it's the sweet spot.
  Plus, users can always quantize down from F16 later if needed. You can't un-quantize back up.
  The Patch Script
  To help others facing similar issues, we createdpatch_converter.py- an automated script that:
  1. Detects if llama.cpp is present
  2. Locates the problematic code inconvert_hf_to_gguf.py
  3. Inserts the fix in the right place
  4. Validates the patch worked
  This makes the conversion reproducible and helps document exactly what changed.
  Future Work
  Additional Quantizations
  The F16 version is ideal for preservation and quality, but many users want smaller variants:
  - Q8_0: Near-lossless quality, ~7.5GB
  - Q5_K_M: Good quality, ~4.5GB
  - Q4_K_M: Balanced quality/size, ~4GB
  - Q4_0: Maximum compression, ~3.5GB
  Each requires thellama-quantizetool:
```
./llama-quantize comma-v0.1-2t-f16.gguf comma-v0.1-2t-q4_K_M.gguf Q4_K_M  
```
  Upstream Contribution
  The proper solution is to submit a pull request to llama.cpp adding Comma v0.1's tokenizer checksum to the official list. This would make the patch unnecessary for future users.
  Performance Benchmarks
  It would be valuable to benchmark the GGUF version against the original PyTorch model to verify conversion accuracy. Metrics like perplexity on a standard corpus would confirm we didn't lose anything in translation.
  Technical Appendix: The Checksum
  For those curious, the checksum is computed over the tokenizer's JSON configuration:
```
tokenizer_config = json.dumps(tokenizer_json, sort_keys=True)  
chkhsh = hashlib.sha256(tokenizer_config.encode()).hexdigest()  
```
  Comma v0.1's checksum:
```
bf66900d65fe80247e435184a4ac839c5c332657cf567e64b8ede5fbd63f5fd9  
```
  This identifies it uniquely as Llama 3-style BPE with 64,256 vocabulary tokens.
  Conclusion
  What started as a simple format conversion became a journey into the guts of language model tooling. The challenges we encountered - missing tokenizers, unrecognized checksums, control flow bugs - are all solvable, but they highlight how much implicit knowledge is embedded in these conversion tools.
  By documenting this process and sharing the solutions, we hope to:
  1. Make Comma v0.1 accessible to the Ollama/llama.cpp community
  2. Help others facing similar tokenizer compatibility issues
  3. Contribute to the broader understanding of model format conversions
  The result: A working, tested, community-ready GGUF conversion of an excellent ethically-trained language model.
  And perhaps most importantly: One more data point showing that competitive language models can indeed be built on purely open and ethically sourced data.
  Files Available:
  - comma-v0.1-2t-f16.gguf- The converted model (14GB)
  - patch_converter.py- Automated patching script
  - Modelfile- Ollama configuration
  - Full llama.cpp with patch applied
  Resources:
  - Original model: https://huggingface.co/common-pile/comma-v0.1-2t
  - Common Pile dataset: https://huggingface.co/common-pile
  - llama.cpp: https://github.com/ggerganov/llama.cpp
Conversion date: October 18, 2025

LESSWRONG
LW

LESSWRONG
LW

9

Comma v0.1 converted to GGUF

9

9

Converting Comma v0.1-2T to GGUF: A Technical Journey

Introduction

Why Comma v0.1?

The Initial Attempt

Challenge #1: The Missing Tokenizer

Challenge #2: The Unrecognized Checksum

The Detective Work

Finding the Right Mapping

The First Patch Attempt

Challenge #3: Control Flow

The Corrected Solution

Challenge #4: Quantization Types

Success!

Testing with Ollama

Lessons Learned

1. Tokenizer Checksums are Fragile

2. Model Format Evolution

3. Base Models Need Different UX

4. F16 is Underrated

The Patch Script

Future Work

Additional Quantizations

Upstream Contribution

Performance Benchmarks

Technical Appendix: The Checksum

Conclusion