I was able to convert a GGUF version of Comma-v0.1 and get it working locally in Ollama.
This is, as far as I know, the first and still only LLM trained only on public domain and opt-in data. I'm glad to have this for my own experiments and toys: https://huggingface.co/jadael/comma-v0.1-2t-GGUF
It's a base model, whereas Ollama typically expects chat/assistant or instruct fine-tuned models. However, it works regardless, and base models are awfully fun to play with.
The conversion ended up being a little tricky; I had tried it myself earlier this year (July 2025) when it Comma v.01 first came out, but struggled to unpack what was going on in llama.cpp.
Today Claude Code with Sonnet 4.5 managed to figure it out. Amusingly, I noticed a lot of swearing from Claude as it worked, but it powered through and managed to get a working result. Since it took so much fiddling and swearing, I asked for Claude's account of the process:
Conversion Story
Converting Comma v0.1-2T to GGUF: A Technical Journey
Introduction
When I set out to convert Comma v0.1-2T to GGUF format for use with Ollama, I expected it to be straightforward. After all, it's a Llama 3 architecture model, and llama.cpp has excellent support for Llama models. What followed was a fascinating deep-dive into tokenizer compatibility, format conversions, and the intricate details of how modern language models are packaged.
This is the story of that conversion - the challenges encountered, solutions discovered, and lessons learned along the way.
Why Comma v0.1?
Comma v0.1 is special. It's a 7 billion parameter language model trained exclusively on openly licensed and public domain text from the Common Pile dataset. In an era where training data provenance is increasingly scrutinized, Comma represents a principled approach: competitive performance achieved using only ethically sourced training data.
The model comes in two variants:
Comma v0.1-1T: Trained on 1 trillion tokens
Comma v0.1-2T: Trained on 2 trillion tokens (the one we converted)
Performance benchmarks show Comma v0.1-2T is competitive with Llama 2 7B, OLMo, and DeepSeekLLM on knowledge-intensive and coding tasks. But despite this impressive pedigree, no GGUF conversion existed - making it inaccessible to the llama.cpp and Ollama communities.
The script started well - it recognized the Llama architecture, loaded the model weights, processed all 32 transformer blocks... and then crashed.
Challenge #1: The Missing Tokenizer
The first error was straightforward:
FileNotFoundError: File not found: tokenizer.model
Llama.cpp's converter was looking fortokenizer.model- a SentencePiece format tokenizer file common in older Llama models. But Comma v0.1 uses the newer HuggingFacetokenizer.jsonformat instead.
This wasn't a showstopper - llama.cpp supports both formats. The converter should have fallen back to readingtokenizer.json. So why was it failing?
Challenge #2: The Unrecognized Checksum
The real error came next:
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
chkhsh: bf66900d65fe80247e435184a4ac839c5c332657cf567e64b8ede5fbd63f5fd9
This was more interesting. The llama.cpp converter uses cryptographic checksums to identify known tokenizer formats. When it encounters a new tokenizer, it computes a checksum of the tokenizer configuration and looks it up in a hardcoded table of known tokenizers.
Comma v0.1's tokenizer - while functionally identical to Llama 3's BPE tokenizer - had a unique checksum that wasn't in llama.cpp's database. The converter literally didn't know this tokenizer existed.
The Detective Work
I examined the error more carefully. The converter helpfully prints the problematic checksum and suggests updating theget_vocab_base_pre()function. This function lives inconvert_hf_to_gguf.pyaround line 900.
Looking at the code, I found a long chain of checksum comparisons:
if chkhsh == "some_checksum_1":
res = "llama-bpe"
if chkhsh == "some_checksum_2":
res = "gpt2"
if chkhsh == "some_checksum_3":
res = "llama4"
# ... many more ...
if res is None:
logger.warning("BPE pre-tokenizer was not recognized!")
raise NotImplementedError(...)
The solution was clear: add Comma's checksum to this list and map it to the appropriate tokenizer type.
Finding the Right Mapping
But which tokenizer type should Comma map to? The model uses Llama 3's architecture, and examining thetokenizer.jsonconfirmed it uses Byte-Pair Encoding (BPE) with a structure very similar to Llama 3.
I found several existing Llama 3 models in the converter's checksum list, all mapping to"llama-bpe". This was the answer.
The First Patch Attempt
I created a patch script (patch_converter.py) to automate the fix:
patch_code = f'''
# Patch for Comma v0.1 tokenizer (Llama 3 compatible)
if chkhsh == "{failing_checksum}":
# Comma v0.1 uses Llama 3 style BPE
res = "llama-bpe"
'''
But where to insert it? My first attempt placed the patch after the warning block - which meant it would execute but then the unconditionalraise NotImplementedErrorwould fire anyway.
Challenge #3: Control Flow
The bug was subtle. Looking at the original code:
if res is None:
logger.warning("...")
logger.warning("...")
# ... more warnings ...
# My patch was here
if chkhsh == "bf66900d...":
res = "llama-bpe"
raise NotImplementedError(...) # This always ran!
Theraisestatement was unconditional - it would execute whether or notreswas set! This was clearly a bug in my understanding of the code structure.
The fix: Theraiseshould be inside theif res is None:block. That way it only raises an error if no tokenizer was matched.
The Corrected Solution
The final patch placed the checksum check with the other checksum checks (before theif res is None:), and moved theraiseinside the error block:
# Add Comma's checksum with the others
if chkhsh == "bf66900d65fe80247e435184a4ac839c5c332657cf567e64b8ede5fbd63f5fd9":
# Patch for Comma v0.1 tokenizer (Llama 3 compatible)
# ref: https://huggingface.co/common-pile/comma-v0.1-2t
res = "llama-bpe"
# Check if any tokenizer matched
if res is None:
logger.warning("BPE pre-tokenizer was not recognized!")
raise NotImplementedError(...)
# Continue with conversion
logger.debug(f"tokenizer.ggml.pre: {repr(res)}")
return res
Challenge #4: Quantization Types
With the tokenizer issue fixed, I tried the conversion again:
The conversion took about 15 minutes, processing all 291 tensors and converting weights from bfloat16 to F16 format. The progress output was beautiful:
The final file: 14GB of pure F16 precision goodness.
Testing with Ollama
Creating a Modelfile for a base model required some thought. Base models don't follow instructions - they complete text. The Modelfile needed to reflect this:
FROM ./comma-v0.1-2t-f16.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER stop "<|end_of_text|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
# Note: This is a base model (not instruction-tuned).
# It will continue text rather than follow instructions.
The key insight: no SYSTEM prompt. System prompts are for instruction-tuned models that understand roles and directives. Base models just see more text.
Import and test:
ollama create comma-v0.1-2t -f Modelfile
ollama run comma-v0.1-2t
The first generation was delightful - given the beginning of a conversation, it naturally continued it:
Hello, how can I aid you?
Can you please help me in booking my flight ticket from OAK to DCA?
Sure, may I know your trip dates?
My planned travel dates are Dec 17th and Dec 19th.
Do you have any specifications?
I am looking for a connecting flight which returns at evening time.
[continues naturally with flight booking dialogue...]
Perfect! The model was working exactly as expected for a well-trained base model.
Lessons Learned
1. Tokenizer Checksums are Fragile
The checksum-based tokenizer identification is clever but brittle. Any change to the tokenizer configuration - even cosmetic ones - creates a new checksum that won't be recognized. This is actually good for security (prevents accidental tokenizer mismatches) but bad for new models.
Solution for the community: Submit tokenizer checksums upstream to llama.cpp so future users don't hit this issue.
2. Model Format Evolution
The shift from SentencePiece (tokenizer.model) to HuggingFace JSON format (tokenizer.json) is ongoing. Tools need to handle both gracefully. The llama.cpp converter does this well, but the error messages could be clearer about which format is actually missing.
3. Base Models Need Different UX
The distinction between base models and instruction-tuned models is often lost in documentation. Base models:
Complete text naturally
Don't understand "system" vs "user" roles
Work best with natural continuations
Are ideal for fine-tuning
Instruction-tuned models:
Follow directives and prompts
Understand role-based dialogue
Can be "told" what to do
Are ready for end-user chat
Our Modelfile and documentation needed to make this crystal clear.
4. F16 is Underrated
Everyone jumps to quantized models (Q4, Q5, Q8) to save space. But F16 preserves the full precision of the original model while being significantly smaller than F32. For systems with sufficient VRAM, it's the sweet spot.
Plus, users can always quantize down from F16 later if needed. You can't un-quantize back up.
The Patch Script
To help others facing similar issues, we createdpatch_converter.py- an automated script that:
Detects if llama.cpp is present
Locates the problematic code inconvert_hf_to_gguf.py
Inserts the fix in the right place
Validates the patch worked
This makes the conversion reproducible and helps document exactly what changed.
Future Work
Additional Quantizations
The F16 version is ideal for preservation and quality, but many users want smaller variants:
The proper solution is to submit a pull request to llama.cpp adding Comma v0.1's tokenizer checksum to the official list. This would make the patch unnecessary for future users.
Performance Benchmarks
It would be valuable to benchmark the GGUF version against the original PyTorch model to verify conversion accuracy. Metrics like perplexity on a standard corpus would confirm we didn't lose anything in translation.
Technical Appendix: The Checksum
For those curious, the checksum is computed over the tokenizer's JSON configuration:
This identifies it uniquely as Llama 3-style BPE with 64,256 vocabulary tokens.
Conclusion
What started as a simple format conversion became a journey into the guts of language model tooling. The challenges we encountered - missing tokenizers, unrecognized checksums, control flow bugs - are all solvable, but they highlight how much implicit knowledge is embedded in these conversion tools.
By documenting this process and sharing the solutions, we hope to:
Make Comma v0.1 accessible to the Ollama/llama.cpp community
Help others facing similar tokenizer compatibility issues
Contribute to the broader understanding of model format conversions
The result: A working, tested, community-ready GGUF conversion of an excellent ethically-trained language model.
And perhaps most importantly: One more data point showing that competitive language models can indeed be built on purely open and ethically sourced data.
Files Available:
comma-v0.1-2t-f16.gguf- The converted model (14GB)
I was able to convert a GGUF version of Comma-v0.1 and get it working locally in Ollama.
This is, as far as I know, the first and still only LLM trained only on public domain and opt-in data. I'm glad to have this for my own experiments and toys: https://huggingface.co/jadael/comma-v0.1-2t-GGUF
It's a base model, whereas Ollama typically expects chat/assistant or instruct fine-tuned models. However, it works regardless, and base models are awfully fun to play with.
The conversion ended up being a little tricky; I had tried it myself earlier this year (July 2025) when it Comma v.01 first came out, but struggled to unpack what was going on in llama.cpp.
Today Claude Code with Sonnet 4.5 managed to figure it out. Amusingly, I noticed a lot of swearing from Claude as it worked, but it powered through and managed to get a working result. Since it took so much fiddling and swearing, I asked for Claude's account of the process:
Conversion Story
- Comma v0.1-1T: Trained on 1 trillion tokens
- Comma v0.1-2T: Trained on 2 trillion tokens (the one we converted)
- Complete text naturally
- Don't understand "system" vs "user" roles
- Work best with natural continuations
- Are ideal for fine-tuning
- Follow directives and prompts
- Understand role-based dialogue
- Can be "told" what to do
- Are ready for end-user chat
- Detects if llama.cpp is present
- Locates the problematic code in
- Inserts the fix in the right place
- Validates the patch worked
- Q8_0: Near-lossless quality, ~7.5GB
- Q5_K_M: Good quality, ~4.5GB
- Q4_K_M: Balanced quality/size, ~4GB
- Q4_0: Maximum compression, ~3.5GB
- Make Comma v0.1 accessible to the Ollama/llama.cpp community
- Help others facing similar tokenizer compatibility issues
- Contribute to the broader understanding of model format conversions
- Full llama.cpp with patch applied
- Original model: https://huggingface.co/common-pile/comma-v0.1-2t
- Common Pile dataset: https://huggingface.co/common-pile
- llama.cpp: https://github.com/ggerganov/llama.cpp
Conversion date: October 18, 2025Converting Comma v0.1-2T to GGUF: A Technical Journey
Introduction
When I set out to convert Comma v0.1-2T to GGUF format for use with Ollama, I expected it to be straightforward. After all, it's a Llama 3 architecture model, and llama.cpp has excellent support for Llama models. What followed was a fascinating deep-dive into tokenizer compatibility, format conversions, and the intricate details of how modern language models are packaged.
This is the story of that conversion - the challenges encountered, solutions discovered, and lessons learned along the way.
Why Comma v0.1?
Comma v0.1 is special. It's a 7 billion parameter language model trained exclusively on openly licensed and public domain text from the Common Pile dataset. In an era where training data provenance is increasingly scrutinized, Comma represents a principled approach: competitive performance achieved using only ethically sourced training data.
The model comes in two variants:
Performance benchmarks show Comma v0.1-2T is competitive with Llama 2 7B, OLMo, and DeepSeekLLM on knowledge-intensive and coding tasks. But despite this impressive pedigree, no GGUF conversion existed - making it inaccessible to the llama.cpp and Ollama communities.
The Initial Attempt
The conversion process seemed simple enough:
The script started well - it recognized the Llama architecture, loaded the model weights, processed all 32 transformer blocks... and then crashed.
Challenge #1: The Missing Tokenizer
The first error was straightforward:
Llama.cpp's converter was looking for
tokenizer.model- a SentencePiece format tokenizer file common in older Llama models. But Comma v0.1 uses the newer HuggingFacetokenizer.jsonformat instead.This wasn't a showstopper - llama.cpp supports both formats. The converter should have fallen back to reading
tokenizer.json. So why was it failing?Challenge #2: The Unrecognized Checksum
The real error came next:
This was more interesting. The llama.cpp converter uses cryptographic checksums to identify known tokenizer formats. When it encounters a new tokenizer, it computes a checksum of the tokenizer configuration and looks it up in a hardcoded table of known tokenizers.
Comma v0.1's tokenizer - while functionally identical to Llama 3's BPE tokenizer - had a unique checksum that wasn't in llama.cpp's database. The converter literally didn't know this tokenizer existed.
The Detective Work
I examined the error more carefully. The converter helpfully prints the problematic checksum and suggests updating the
get_vocab_base_pre()function. This function lives inconvert_hf_to_gguf.pyaround line 900.Looking at the code, I found a long chain of checksum comparisons:
The solution was clear: add Comma's checksum to this list and map it to the appropriate tokenizer type.
Finding the Right Mapping
But which tokenizer type should Comma map to? The model uses Llama 3's architecture, and examining the
tokenizer.jsonconfirmed it uses Byte-Pair Encoding (BPE) with a structure very similar to Llama 3.I found several existing Llama 3 models in the converter's checksum list, all mapping to
"llama-bpe". This was the answer.The First Patch Attempt
I created a patch script (
patch_converter.py) to automate the fix:But where to insert it? My first attempt placed the patch after the warning block - which meant it would execute but then the unconditional
raise NotImplementedErrorwould fire anyway.Challenge #3: Control Flow
The bug was subtle. Looking at the original code:
The
raisestatement was unconditional - it would execute whether or notreswas set! This was clearly a bug in my understanding of the code structure.The fix: The
raiseshould be inside theif res is None:block. That way it only raises an error if no tokenizer was matched.The Corrected Solution
The final patch placed the checksum check with the other checksum checks (before the
if res is None:), and moved theraiseinside the error block:Challenge #4: Quantization Types
With the tokenizer issue fixed, I tried the conversion again:
New error:
It turns out
convert_hf_to_gguf.pyonly does basic conversions. Advanced quantization types like Q4_K_M require a separate step usingllama-quantize.Since we wanted the full-precision version anyway (to preserve quality and allow users to quantize to their preferred format later), we used F16:
Success!
The conversion took about 15 minutes, processing all 291 tensors and converting weights from bfloat16 to F16 format. The progress output was beautiful:
The final file: 14GB of pure F16 precision goodness.
Testing with Ollama
Creating a Modelfile for a base model required some thought. Base models don't follow instructions - they complete text. The Modelfile needed to reflect this:
The key insight: no SYSTEM prompt. System prompts are for instruction-tuned models that understand roles and directives. Base models just see more text.
Import and test:
The first generation was delightful - given the beginning of a conversation, it naturally continued it:
Perfect! The model was working exactly as expected for a well-trained base model.
Lessons Learned
1. Tokenizer Checksums are Fragile
The checksum-based tokenizer identification is clever but brittle. Any change to the tokenizer configuration - even cosmetic ones - creates a new checksum that won't be recognized. This is actually good for security (prevents accidental tokenizer mismatches) but bad for new models.
Solution for the community: Submit tokenizer checksums upstream to llama.cpp so future users don't hit this issue.
2. Model Format Evolution
The shift from SentencePiece (
tokenizer.model) to HuggingFace JSON format (tokenizer.json) is ongoing. Tools need to handle both gracefully. The llama.cpp converter does this well, but the error messages could be clearer about which format is actually missing.3. Base Models Need Different UX
The distinction between base models and instruction-tuned models is often lost in documentation. Base models:
Instruction-tuned models:
Our Modelfile and documentation needed to make this crystal clear.
4. F16 is Underrated
Everyone jumps to quantized models (Q4, Q5, Q8) to save space. But F16 preserves the full precision of the original model while being significantly smaller than F32. For systems with sufficient VRAM, it's the sweet spot.
Plus, users can always quantize down from F16 later if needed. You can't un-quantize back up.
The Patch Script
To help others facing similar issues, we created
patch_converter.py- an automated script that:convert_hf_to_gguf.pyThis makes the conversion reproducible and helps document exactly what changed.
Future Work
Additional Quantizations
The F16 version is ideal for preservation and quality, but many users want smaller variants:
Each requires the
llama-quantizetool:Upstream Contribution
The proper solution is to submit a pull request to llama.cpp adding Comma v0.1's tokenizer checksum to the official list. This would make the patch unnecessary for future users.
Performance Benchmarks
It would be valuable to benchmark the GGUF version against the original PyTorch model to verify conversion accuracy. Metrics like perplexity on a standard corpus would confirm we didn't lose anything in translation.
Technical Appendix: The Checksum
For those curious, the checksum is computed over the tokenizer's JSON configuration:
Comma v0.1's checksum:
This identifies it uniquely as Llama 3-style BPE with 64,256 vocabulary tokens.
Conclusion
What started as a simple format conversion became a journey into the guts of language model tooling. The challenges we encountered - missing tokenizers, unrecognized checksums, control flow bugs - are all solvable, but they highlight how much implicit knowledge is embedded in these conversion tools.
By documenting this process and sharing the solutions, we hope to:
The result: A working, tested, community-ready GGUF conversion of an excellent ethically-trained language model.
And perhaps most importantly: One more data point showing that competitive language models can indeed be built on purely open and ethically sourced data.
Files Available:
comma-v0.1-2t-f16.gguf- The converted model (14GB)patch_converter.py- Automated patching scriptModelfile- Ollama configurationResources: