I did it. I built an oracle AI.
Or at least, I did for one definition of “oracle”.
It’s called Kaku-Ora, and it’s an AI divination oracle inspired by the likes of the I, Ching, but trained on Zen koans. You ask it a question, it treats your question like the first half of a koan. It then gives a response and offers a capping verse to “explain” the response.
Why would I do such a thing? Several reasons:
I find divination oracles like the I, Ching and Tarot interesting, not because I think they work by some supernatural means, but because injecting randomness into our thought processes can induce insight.
I wanted to get some hands-on experience training AI models.
Diffusion models are interesting!
In particular, diffusion models seem like a good fit for mimicking how koan work is done.
I have extremely vague ideas that it might be a good idea to teach AI dharma, as this might teach them compassion, and sufficiently compassionate AI might be aligned.
What follows is the story of how I built Kaku-Ora, and the many mis-steps I made along the way.
To even get started I needed a corpus of training data.
Luckily, most core Zen texts are online. Unfortunately, there’s just not that much of it because Zen emphasizes transmission outside scripture. Also, a lot of the text is crammed inside low quality PDFs.
Thankfully, Claude turns out to be really good at creating training data (a fact that’s unsurprising once you think about it). I was able to hand it large chunks of text, give it a few examples of what I wanted, and in the end got nearly 20k examples in JSONL format—~5k “clean” examples from the koan literature, and another ~15k from non-koan Zen exchanges I found in texts.
I picked diffusion because it isn’t autoregressive. Most language models generate output one token at a time, in order. Diffusion models are different. They start with noise, then iteratively refine the whole output to remove noise until something coherent is revealed.
This kind of approach seemed like a better fit for koans. Koans don’t unfold through linear logic. They come from an expression of direct experience. When a response comes to you, it’s as if it appears, wholly formed, from the void. My theory is that diffusion better replicates that.
To be clear, I don’t think that diffusion can really do, on its own, what a human does when they work with a koan. But compared to other language generation approaches, it seems like a better approximation, so it’s the class of model I set out to train.
My first attempt followed the Diffusion LM paper. The idea was to embed text as vectors in a continuous space, add noise, and train the model to denoise. So basically image diffusion, but for language.
I spent a week doing training runs. At one point I even tried fine tuning on an existing diffusion language model. In the end, I got models that mostly refused to generate anything, and when they did, produced long strings of punctuation.
The problem was that 20k examples was way too small to train continuous diffusion from scratch. Even with tricks I did to get the training set closer to 40k, it wasn’t enough. The original Diffusion LM work required a large data set and converged very slowly. I was doing training runs on my MacBook with a small corpus. It was a good idea, but I didn’t have the data to make it work.
If I learned anything from my CS degree, it’s that discrete is better than continuous, so I tried discrete diffusion next, specifically masked discrete diffusion (MDLM). The way this works is to train the model on examples that have some of their tokens masked, like “The [mask] of [mask] is [mask] [mask]”. The model learns to predict what token is behind the mask, and then at generation time it iteratively unmasks a string of masks of the desired maximum length to reveal the output.
This worked better. The model produced actual words instead of just punctuation. But it fixated on common tokens. Lots of “the the the...” and similar repetitions.
I thought this approach had promise, so I next tried bootstrapping with GPT-2’s pretrained embeddings, hoping to give the model a head start. It helped a little. Now instead of repeating “the,” it would repeat meaningful words, like “the mind mind mind mind...”. Sadly, still not good enough.
At this point I decided to give up on diffusion. Maybe there’s a way to make it work, but the evidence was rolling in that my data set was too small and existing diffusion text models too primitive for me to fine tune on it. It’s possible that, given enough compute, I could have fine tuned a large diffusion language model, but that was going to require hundreds of hours to run on my laptop, so I started exploring other options.
Masked diffusion was showing promise, even if it wasn’t producing the results I needed. Maybe the problem wasn’t masking, but diffusion. What if I tried a model that’s really good at masked prediction?
BART is a transformer model trained by corrupting text and having it learn to reconstruct it. And unlike GPT, which predicts the next token, BART predicts missing tokens anywhere in the sequence because it denoises, similar to what MDLM does, but with a different model architecture (encoder-decoder).
It was clear I didn’t have enough data to train a BART-style model from scratch, so I tried both fine-tuning and LoRA, a process similar to fine tuning that trains a small layer to transform model weights rather than updating the weights directly. The results were better than MDLM with GPT-2 embeddings, but still not useful. The model wasn’t quite as repetitive, but the responses still weren’t coherent, producing output like “the mind of Buddha is the mind of Buddha is the mind of Buddha is the mind of Buddha.”
Maybe the issue was that denoising, while it could work in theory, wasn’t the right approach for my open-ended, Q&A task. Maybe what I needed was to try a model specifically trained on Q&A and then fine tune it. So, that’s what I did next.
Flan-T5 are a class of language models that put an encoder in front of a decoder, and they’re frequently used to train models that answer questions. Keeping in mind I’m doing all the training on my MacBook, I tried Flan-T5-small, which weighs in at just 77 million parameters.
It worked almost immediately.
It took about two hours of fine tuning against my full dataset to produce the Kaku-Ora model. No fancy tricks, just simple, straightforward fine tuning.
Could I get better output with a bigger model? Probably. Maybe at some point I’ll go back and try it. But the quality of the output was already sufficient that I was willing to call this experiment a success. You can see the kind of output it produces by trying Kaku-Ora for yourself.
If you’ve used the I, Ching or read a koan collection, you know that they contain more than hexagrams and koans. They also contain analysis. In koan collections, this typically takes the form of a capping verse—a short poem that elucidates the essential matter addressed by the koan.
I wanted that for my oracle. I was starting to envision deployment. A small website. You enter a question. The oracle contemplates. Then you get its response, and are also supplied a verse to help you make sense of it.
I initially used Claude Haiku for the verses. The quality was great. Haiku understood the genre and produced genuinely poetic commentary. But running Haiku costs money, even if not a lot of money, and I didn’t want to worry about abuse if I made the oracle publicly available.
So I switched to Qwen2.5-1.5B-Instruct, which runs free on HuggingFace. The verses are worse. If I cared to invest more time to make sure nobody can spam the oracle and run up my Anthropic bill, I’d have used Haiku.
Model trained, how do I get it out in the world? The easy option was to run on HuggingFace since I have no need to keep the model weights secret. And running there on CPU is free. Good enough.
I then worked with Claude to vibe code a site. Much of the value of an oracle is in your interactions with it, so it was worth putting some effort into presentation.
Much of the point of training this model was to get more hands-on experience training AI. The trouble has been that, at work, RAG + hosted LLMs are more than enough for most tasks, and while it’s great to get the job done, it doesn’t help me learn the fundamentals.
Some things I really know now, having trained Kaku-Ora:
Data matters a lot.
The right model architecture also matters a lot.
Insufficient scale is a problem, but also a chance to get creative.
Getting a model to work at all is hard.
Refining it is equally hard (but in a different way).
Diffusion is hard because it’s slow to converge, even with good data.
I also came away with a sense that, if I wanted to make Kaku-Ora better, I could probably do it by investing more effort in a few areas:
data cleaning
more data
training on larger models
I don’t know! I’m thinking about working on evals to see if I can assess how well a model understands koans. I did a small experiment in this direction once, and the results were underwhelming but promising. Seems worth trying again, and in a more formal way.
Until then, Kaku-Ora is live! Ask it a question. See what comes back.
The code is on GitHub, and the model runs on HuggingFace Spaces.