128

LESSWRONG
LW

127
Language Models (LLMs)AI
Frontpage

3

The Common Pile and Comma-v.01

by Trevor Hill-Hand
6th Jun 2025
1 min read
0

3

3

New Comment
Moderation Log
More from Trevor Hill-Hand
View more
Curated and popular this week
0Comments
Language Models (LLMs)AI
Frontpage

https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html

...the LLM's dataset uses only public domain and openly licensed material.

The paper (via The Washington Post) was a collaboration between 14 different institutions. The authors represent universities like MIT, Carnegie Mellon and the University of Toronto. Nonprofits like Vector Institute and the Allen Institute for AI also contributed.

The group built an 8 TB ethically-sourced dataset. Among the data was a set of 130,000 books in the Library of Congress. After inputting the material, they trained a seven-billion-parameter large language model (LLM) on that data. The result? It performed about as well as Meta's similarly sized Llama 2-7B from 2023. The team didn't publish benchmarks comparing its results to today's top models.

  • GitHub Repo: https://github.com/r-three/common-pile/tree/main
  • HuggingFace: https://huggingface.co/common-pile

This was wonderful news for me today, if this works and is what they say it is. I have several hobby projects that I've basically kept locked in a vault because I've felt there was no ethical LLM available to encourage people use for them, even just for 'fun" projects.