Text Compression Can Help Secure Model Weights

Roy Rinberg

Associated Paper: link^[1], code

Paper Co-authors: Roy Rinberg, Annabelle Michael Carrell, Simon Henniger, Nicholas Carlini, Keri Warr

Live Demo webpage: link

Two years ago, Ryan Greenblatt wrote a post titled "Preventing model exfiltration with upload limits". The idea of the original post is simple: if there's a single wire connecting to a GPU that allows egress of 100 GB per day, then even if an adversary has full control of the GPU, there's nothing they can do to exfiltrate more than 100 GB in a single day.

The main challenge is that you can't set your egress limit lower than the amount of data you need to send to your customers. To contextualize this with some rough approximations: my estimate is that OpenAI produces about 1 TB of text per day, and a frontier model is approximately 1 TB^[2]. So egress limiting alone buys us about 1 day before an adversary could steal the weights.

With a good monitoring and incident-response system in place, an attacker is incentivized to exfiltrate as much as possible while staying under the detection threshold. Any exfiltration attempt is forced to tradeoff stealthiness against exfiltration-volume: the more data you try to steal, the more likely you are to get caught.

The notable exception to this is that if the attacker can steal the entire model in a single operation, it doesn't matter if they're caught as by the time the defenders can respond, the weights are already gone. Egress limiting serves as a powerful backstop: even if an adversary has fully compromised the system, they cannot exfiltrate faster than the physical limit allows. Egress-limiting can be thought of as buying time for detection-and-response systems to kick in.^[3]

Greenblatt's post builds on this design with a simple observation: text is fairly compressible, and intrinsically more compressible than model weights^[4]. Thus, we can compress text before it leaves the GPU, send the compressed version over the wire, and decompress on the other side. This shrinks the bandwidth consumed by legitimate usage, letting us set a tighter egress limit.

We recently wrote a paper, “Haiku to Opus in Just 10 Bits: LLMs Unlock Massive Compression Gains” where we empirically investigate how compressible LLM outputs are and develop novel compression techniques for lossless and lossy compression. In this blog and the associated paper, we look at things that can be implemented today, as well as more future-looking techniques; namely we have 3 directions we explore:

Lossless compression: using LoRAs to improve the compression ratios of LLM-based arithmetic coding.
Lossy rewriting: a helpful baseline where we run compression after simply asking the model to rewrite its own response in a compressed format.
Question-Asking (QA): we further introduce Question-Asking compression (QA), an interactive protocol for lossy compression, inspired by the game “Twenty Questions”, where a small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer.

Throughout this post we will touch back to what these different levels of compression imply for worst-case time to exfiltration. We will assume that a lab is producing approximately 1 TB of text per day, and a model is approximately 1 TB, assuming even-usage, we can approximate this as egress limiting would enable 1 day of worst-case exfiltration-prevention.

In a related regime, one can easily imagine the need for security for an on-premise deployment of an open-source model, where the model is public, but it has private LoRA. A BOTEC calculation, based on aggregating from ~20 of my peers working on AI safety, API calls and agents (e.g. Claude Code) produce 5GB of egress per week and approximately 20 times more than that in ingress. A single LoRA can comfortably be in the range 10MB to 200MB. In this setting, the team is using around 0.7 GB per day, this would be on the range of taking 1/70th to 1/3th of a day to exfiltrate (between 20 minutes, and 8 hours).

For the rest of this post, we put things in context of two numbers:

Worst-case exfiltration prevention of a model (1TB) with 1TB of normal user traffic egress per day: 1 day
Worst-case exfiltration prevention of a 100MB LoRA, with 1 GB of egress per day: 2.4 hours

Note: These numbers are not meant to be exact, and if you have better estimates, please reach out or comment. Additionally, one of the challenges of talking about compression is that the level of compression does depend heavily on what you are compressing. For each of the compression ratios stated before, we will put it in context of what is being compressed.

Lossless Compression with LLMs

Arithmetic coding is a compression scheme that can be shown to achieve a (nearly) optimal compression ratio when the probability distribution used to compress matches the true distribution of the data. The better your probability estimates, the fewer bits you need. LLMs produce next-token probability distributions as their core operation, so there's a natural connection: feed text through the model token by token, and use each next-token distribution to encode the corresponding token via arithmetic coding.

Prior work (TS-zip, LLM-Zip, Language Modelling is Compression) has shown that LLM-based arithmetic coding achieves roughly a 10% compression ratio on benchmarks like Wikipedia, compared to about 33% for gzip. Compression quality depends heavily on how well the model's distribution matches the text being compressed. A general-purpose LLM will compress Wikipedia reasonably well, as it was likely trained directly on Wikipedia. However, it’s worth noting that this ratio is highly dependent on the actual content of what is being compressed.

We address this with domain-adapted LoRA adapters. LoRAs let you fine-tune a lightweight adapter on top of a base model, specializing it for a particular domain at minimal cost. We train a range of LoRAs on various topics (e.g. a medical LoRA, a math LoRA, etc.). We evaluate this approach on various partitions (created through unsupervised clustering) of LMSYS-Chat and WildChat, two large-scale datasets of real user-chatbot conversations, and we find that domain-adapted LoRAs reduce the compression ratio from 0.18 (base Mistral-7B) to 0.09, a 2x improvement.
Even using a RAG-based router that selects the appropriate LoRA from the *prompt alone* (before seeing the response), we still see a ~1.9x improvement (0.10 compression ratio).

To put this into a security framework, if the text you seek to compress looks like LMSYS, then using arithmetic coding at all, buys you ~5x improvement in time, and then using a LoRA doubles it.

Worst-case exfiltration prevention of a model (1TB) with 1TB of normal user egress per day: 5 days→10 days
Worst-case exfiltration prevention of a 100MB LoRA, with 1 GB per day of egress traffic: 0.5 days →1 day

Lossy Compression

So far, we have discussed lossless compression, where the decompressed text is identical to the original. However, in most LLM chatbot interactions, one does not require reproducing a specific sequence of tokens; the chatbot only needs to return a “useful” response.

One straightforward approach possible here, is simply to ask the model to rewrite its own response "as succinctly as possible while preserving enough information to infer the correct answer." Empirically, on responses from AIME and MBPP (a math and coding dataset respectively) this method achieves a compression ratio of about 0.03, which is roughly 2x better than lossless compression alone. We also tested a best-of-N baseline (generate N responses, keep the most compressible one), which only yields ~30% improvement, far less than “just asking” for compression.

To put this into a security framework, if the text you seek to compress looks like math/coding benchmarks, then using arithmetic coding at all, buys you ~10x improvement in time, and then asking a model to rephrase approximately doubles it.

Worst-case exfiltration prevention of a model (1TB) with 1TB of normal user traffic per day: 10 days → 20 days
Worst-case exfiltration prevention of a 100MB LoRA, with 1 GB per day of egress traffic: 1 days→ 2 days

Question-Asking Compression (QA)

We also introduce a more extreme form of compression, inspired by the game 20 Questions. A small language model (SLM) (e.g., Haiku) attempts the problem on its own, then asks the large language model (LLM) (e.g., Opus) a series of yes/no questions (e.g., "Is my approach to the integral correct?", "Should I use dynamic programming instead?"). Then, after 10 questions, the small model revises its answer. As each answer is exactly 1 bit, we show that it is possible to compress concepts in less than 10 bits.

Excitingly, the questions do not actually need to be sent between from the SLM to the LLM, because given the prompt the LLM can exactly simulate the the SLM's questions locally. As such, only the prompt (ingress) and the LLM's answers (egress) ever need to be sent over the bandwidth limited barrier.

Primarily, we investigate the setting QA, where Haiku asks questions, Opus answers them, then Haiku revises its original question.

While we see impressive gains, we find that this scheme does not work for 100% of all questions. There are many possible reasons for what could be going on when the QA-compression scheme fails to recover the right answer:

The scheme may be a poor scheme
It is also possible that the scheme would generally work, but Haiku is not sufficiently good at asking questions, and in future iterations of this, the scheme would work well.

To account for this second reason, we include an ablation into the value of question-asking itself, where Haiku generates a response, and then we call Opus to ask questions that would help it improve, then a different instance of Opus answers them, we call this method QA+ (while this is not a realistic setting it still gives us insight into ways this scheme can be used in the future).

Further, we observe that this entire process is a constrained form of Chain-of-Thought, and in some ways we are giving the small model more test-time compute to answer the question. As a baseline, we also run an experiment where Haiku asks questions and answers them itself, we call this Bit-Limited Chain-of-Thought (BL-CoT).

We evaluate on math (GSM8K, MATH, AIME), science (GPQA), code (MBPP), and frontier challenges (Humanity's Last Exam), and we measure "recovery rate": the fraction of problems where Haiku gets the wrong answer on its own, but gets the right answer after the protocol.

We find that this method recovers 23-72% of the gap between Haiku and Opus on easier datasets, and 7-38% of the gap on harder datasets. While effectively achieving a compression ratios between 0.004 and 0.0006 (when it works), depending on the dataset^[5].

Some high-level observations we make is that the QA protocol does work, but not uniformly. On easier benchmarks (GSM8K, MATH, GPQA), most of the recovery comes from the structure of asking questions rather than the frontier model's answers; BL-CoT alone performs nearly as well as QA. On harder benchmarks (MBPP, HLE), the frontier model's answers matter more, and QA+ yields further gains^[6].

Critically, this scheme is not ready to be put into production - it does not work 100% of the time, but only successfully compresses some responses. However, if it worked deterministically, then if the text you seek to compress looks like math/coding benchmarks, then using a scheme like this could extend worst case egress to:

Worst-case exfiltration prevention of a model (1TB) with 1TB of normal user traffic per day: 250 days
Worst-case exfiltration prevention of a 100MB LoRA, with 1 GB per day of egress traffic: 25 days

Lastly, it’s important to note that these compression schemes are not a safety mechanism for preventing a user from making a dangerous query, they are only a security mechanism for enabling extreme amounts of egress limiting.

Future Directions

I believe this work opens the door for some new pretty exciting opportunities, both in improving the scheme as well as imagining new paradigms for security.

Potential Technical Improvements to be Made to QA compression

Improving Question Asking. Empirically, we saw that QA+ does better than QA; thus, it seems likely that improving the model's ability to ask questions will be extremely helpful. This is likely possible simply using better prompting.

Reliability. At the moment, QA meaningfully improves the small model's response only some percentage of the time. For this method to be useful in practice, the method needs to be made more robustly reliable — it seems significantly more valuable to reliably produce a smaller compression ratio than to occasionally produce a large compression ratio. However, using Arithmetic coding can always be used to back-stop the QA compression scheme, as a "worst case" compression if the private (large) LLM detects the public (small) LLM cannot recover the answer.

LLM-initiated information injection. Our protocol relies entirely on the small model to ask the right questions, but some problems don't fit this format. If the small model lacks a key piece of knowledge, no sequence of binary questions will surface it (e.g., "what is the square root of 17 (to 6 decimal points)": if you don't know how to compute this, it will be very hard to ask enough binary questions to figure it out). Allowing the frontier model to inject information ("4.123105" or "use Newton's method") could help a lot. This is very related to the work in Adversarial Compression Ratios (ACR) . (Note: while I wasn't able to make an extension of ACRs work, I'm confident there's a lot to explore here).

A possible future: Recursively Asking the Wisest Monk

The QA-compression scheme described in this work considers a single capable model interacting with a single weaker one. But the idea naturally extends to a hierarchical setting.

Imagine a sequence of models arranged in tiers, where each inner tier contains a more capable model and is protected by stricter egress limits (akin to top-secret protocols for accessing sensitive documents). Communication between tiers occurs through QA-style compression, where the most capable model sits at the center, transmitting information outward only through increasingly bandwidth-constrained intermediaries.

As diagrammed in the figure below:

In the lossless regime, each tier transmits the same underlying information, but encodes it using its own model. As we move outward (left-ward in the diagram), weaker models require more bits to represent the same content. Consequently, outer tiers consume more bandwidth, while the innermost model requires the least egress. In theory this scheme could be implemented today.

In the lossy regime, the situation is more complex. As information moves outward, lower-tier models may reinterpret or simplify it. In some cases, this may make the content more compressible, while in others, weaker models may require substantially more interaction to resolve complex concepts. I find it hard to predict how this would play out.

However, it's worth noting that this is a valuable method if the dangerousness of a model scales proportionally with the capabilities of the model; however, if all models beyond a certain capability are considered maximally dangerous, then this style of hierarchical egress limiting is potentially not that valuable.

In the lossy-compression regime, this scheme, is not directly implementable today, and several things need to be improved on before we can try to build this. In particular, any lossy compression scheme must come with strong guarantees about failure modes. A probabilistic or silent failure is unacceptable in most sensitive settings, and so at the very least, errors must be detectable.

Philosophizing

Thinking about QA compression has forced me to take a new perspective on how LLMs really do open up new forms of interaction and frameworks on the world. Most excitingly this scheme has made me think three semi-novel ideas about science, and what I used to consider science-fiction.

From a purely philosophical perspective, this question-asking framework offers a lens on the scientific process itself. When a chemist wants to synthesize a new chemical, they form a hypothesis, run an experiment, and get back a low-bit answer: it worked, or it didn’t. Then, if the experiment fails, the scientist returns to the problem with a new question: “Was the temperature wrong?” “Did I use the wrong catalyst?” In this light, each experiment is a yes/no question posed to nature. This maps onto QA-compression scheme: the scientist is the small, less-capable model, which formulates candidate solutions and asks binary questions; and nature is the LLM providing yes/no feedback. This framing and this paper provides quantifiable support to the pedagogical adage that the hard part of science is asking a good question.

Next, I find it profound that this Question Asking framework can be used as a means to measure the relative information between two intelligent beings on a specific topic. Relative information is both technically and philosophically challenging to measure -- how much more information does Geoffrey Hinton know than me on the topic of “Neural Networks”, what about “Cooking an Omelette"? The Question-Asking framework provides structure for beginning to answer such profound questions, with relatively tight bounds. I’ll note that this is similar in flavor to recent work on “Adversarial Compression Ratios”, which measures how memorized text is by an LLM, with the minimum number of tokens needed to prompt that exact generation of text.

Lastly, and most science-fiction-esque, I now think that LLMs really will introduce a novel form of extremely low-bandwidth communication. Two parties can simulate the same model, then to communicate a complex thought they only need the “difference” between their shared representation. Very complex ideas will be able to be sent through incredibly low-bandwidth channels. While this last idea has existed in some writings before, I have not seen it operationalized into practice.

Ending Note

I think text compression has a meaningful role to play in making egress-limiting an even more meaningful defense for preventing model weight exfiltration. Please feel encouraged to disagree with me. And if you're interested in doing research relating to this, or development work to implement similar schemes, please reach out.

Acknowledgements

Many thanks to my co-authors on this work, and special thanks to Maria Kostylew for comments on this blog post. Thank you to Coefficient Giving, MATS, Anthropic, and Harvard for compute credits, GPU usage, and funding.

Appendix: Analysis and Examples of QA Compression Transcripts

Looking at the transcripts, the small model generates yes/no verification questions that walk through its own solution step by step. Interestingly, 31% of recovered problems had every answer confirmed as "yes'' by the frontier model; the structured self-verification alone was enough to fix the error. For example, on a GSM8K tax problem, Haiku asks questions like "Does the solution correctly apply the 10% tax rate only to the nonfood subtotal?'' and "Are all five items accounted for with their correct prices''; the frontier model confirms each one, and Haiku revises its answer anyway.

In another 37% of recoveries, the small model asks mostly confirming questions but surfaces one or two key misconceptions. On a GSM8K word problem, Haiku asks ``Does 10 times more than the fourth friend' mean adding 10 times the fourth friend's presses?'', and the frontier model answers "no", pinpointing the exact misinterpretation; the other four questions are all confirmed. This pattern, where the small model's approach is mostly sound and just needs a targeted correction, accounts for the majority of successful recoveries.

When QA fails, the transcripts reveal two characteristic failure modes. In one, the small model verifies all the steps of a wrong solution path. For example, on a MATH algebra problem, Haiku carefully checks each step of its factorization and sign analysis, the frontier model confirms them all, but the fundamental setup is wrong and the questions never probe it. In the other failure mode, the small model is too confused to ask useful questions at all. This is particularly stark on code benchmarks (MBPP), where the small model asks about edge cases, type hints, and error handling (``Should the function validate that the radius is non-negative?'', ``Would it be beneficial to add type hints?'') rather than probing its actual bug, however the corrections do not help because they are orthogonal to the real issue.

Some examples of transcripts are laid out in Appendix and below. We upload a collection of transcripts to Hugging Face.

^{^}
This would ideally be on Arxiv, but has been on-hold for several weeks now; once it is uploaded, I will update the link.
OpenRouter's State of AI reports approximately 7 trillion tokens per week in November 2025, or about 1 trillion tokens per day. At 2 bytes per token, that's roughly 500 GB. If approximately half are output tokens (likely fewer), OpenRouter services about 250 GB of output tokens per day across all models. A rough guess is that OpenAI produces ~4x more tokens than OpenRouter, giving us ~1 TB. ↩︎
It’s worth noting that Anthropic has rolled out egress limiting as part of its defenses: https://www.anthropic.com/news/activating-asl3-protections. ↩︎
Though, people do try to compress model weights; lossless: https://arxiv.org/abs/2504.11651, lossy: https://arxiv.org/abs/2601.01296. ↩︎
To put that in perspective: a typical Opus response to an AIME problem is about 1,083 tokens, or roughly 18,000 bits, whereas this would be compressing the response in 10 bits. ↩︎
We also tested scaling from 10 to 100 questions, and the gains were small and inconsistent. The bottleneck appears to be question *quality*, not quantity. And we replicated these experiments with GPT-OSS-120B as the small model and saw similar patterns. ↩︎

Another advantage of your method that might actually convince capabilities-focused researchers to use it: it could be more efficient! The strong LLM does all the heavy lifting, outputting as few tokens as possible. The weak LLM does the easy stuff like coming up with a first draft and writing the final response.

My estimates on text traffic and model weight sizes:

Text traffic

OpenAI was doing ~8.64 trillion tokens per day in October 2025 on the API (not including ChatGPT)

Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July.

It's uncertain how many of these are output tokens. Based on OpenRouter's numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I'm not sure how many data centers they have - if there's 10 data centers, then it would be 700GB per day.

It has been 8 months since then, so a naive extrapolation suggests this is now ~16x larger.

For model weight sizes, Grok-3 and Grok-4 were 3 trillion parameters per Elon. This would be ~1.5-3TB, depending on if they use 4 or 8 bit datatypes. Grok-5 will be 6 trillion parameters, also per Elon.

Anthropic's Feb 2026 risk report discussed securing "multi-terabyte model weights".

So a reasonable estimate is 1-10TB of output tokens per datacenter per day, and 1-10TB model weight size.

Very helpful data points and BOTEC. I largely agree with your estimates. Thank you

OpenAI produces about 1 TB of text per day, and a frontier model is approximately 1 TB. So egress limiting alone buys us about 1 day before an adversary could steal the weights.

Wouldn't someone notice if 100% of requests to OpenAI failed for 24 hours straight though?

It seems like the amount of egress an attacker could use without getting caught is a function of buffers (how much higher does the egress limit need to be to handle spikes of traffic) and how long they can get away with an obvious attack, not just the total amount of bandwidth that could be used if the network wasn't doing anything else and/or no one was paying attention.

Wouldn't someone notice if 100% of requests to OpenAI failed for 24 hours straight though?

Yes, definitely. A few responses:

Egress-limiting sets up a information-theoretic backstop, there is simply no way to steal more than X GB in a day across a channel that allows for X GB per day. The more time this backstop buys us, the more leeway we have with our monitoring + incident response systems.
I don't think it's likely that 100% of the traffic would be corrupted, but I don't know what is the most likely world, and it's easy to linearly scale based on what security people think is the most likely outcome. (e.g. 10% of the datacenter -> 10x increase timelines).
There's a cognitive load benefit for security engineering - it's extremely convenient to think about egress-limiting because it's information-theoretic bound. You should be able to easily confirm that the egress-limiting is implemented properly because it should be <200 lines of code (or is literally just the width of a wire). And "complexity is the worst enemy of security" (post by Bruce Schneier).

To give a somewhat long-winded tangent; I find it useful to think about exfiltration in three settings:

Setting 1: Fast exfiltration. An extremely rapid exfiltration occurs before human intervention is possible. For example, an adversary compromises 10,000 servers and leaks 1/10,000th of the weights from each of them.
Setting 2: Intermediate exfiltration. A leak occurs over days or weeks - that is, a human-operable amount of time.
Setting 3: Slow exfiltration. There is a length of time (e.g., 2 years) where even if the adversary could steal the model weights in that amount of time, it wouldn’t matter by the end, because new models would be meaningfully different.

In this framing, egress-limiting is tasked with moving us from setting 1 to setting 2. And then we can have monitoring + response schemes that move us from setting 2 to setting 3. (Relatedly, you may find it interesting to take a look at my other work on inference verification to detect steganographic exfiltration attempts.)

It seems like the amount of egress an attacker could use without getting caught is a function of buffers

Again, yes I agree. 2 responses:

Yes, we would certainly need to include buffers to these bandwidths, the calculations in the post are a back-of-the-envelope-calculation, and I imagine the amount of buffer required is still in the same order of magnitude. (Plus I don't work at a lab, nor would I be allowed to share such numbers.)
More algorithmically, there are rate-limiting methods that permit bursts above a baseline rate while enforcing a hard cap on the total across a longer period (e.g., 1 day). For one, there's a method called Token Bucket Rate Limiting.

Last, most speculative, point: in the most extreme world where a powerful adversary (e.g. nation state) physically takes over a datacenter. Egress-limiting (if connected to other defenses, like tamper-proof enclosures that destroy the GPU when opened) gives us time to reacquire the GPUs/model-weights, because even with sudo access to the GPU, they still can't exfiltrate weights in less than X-amount-of-time. Though, this is probably wrong in some way, because either nation-states may be able to remove the enclosures and side channel into the GPU some other means, or once the weights are stolen it's unlikely we steal them back.

Interesting! This is very different from the application you describe, but I think Question-Asking could be directly applied to AI control. I'm curious what you think about this.

Scenario: Imagine a high-stakes AI control setting, where there's a strong untrusted model U and a weak trusted model T. U secretly wants to do some catastrophic action, like submitting a PR that contains a hidden backdoor. However, we put in place a Question-Asking protocol, so T is the one doing all of the actions.

I claim it's extremely hard for U to make T insert a backdoor just by answering its yes-or-no questions. As you show in your paper, we recover a pretty good chunk of U's performance. So in the terminology of AI control, we get a point on the Pareto frontier between "safety" and "usefulness" with very high safety and medium usefulness.

Yeah - I agree.

I claim it's extremely hard for U to make T insert a backdoor just by answering its yes-or-no questions.

Seems very likely. I think you don't need to be as restrictive though. I think forcing U and T to have to communicate through any kind of information-bottleneck (not just binary QA).

For example, the dual-LLM setting is used in CAMEL to defend against prompt-injection. There they have a 2nd LLM that monitors the actions of the 1st LLM, the 2nd LLM only sees meta-data that summarizes the interaction the 1st LLM is trying to do (e.g. "agent X is requesting to send data type Y to location Z". This restriction of the types of things the 2nd LLM sees makes it difficult to prompt inject the 2nd LLM.

I think safety through information-bottlenecks is a good direction. While I don't personally know people working in this direction, I think this is something people are exploring. I'd be curious to hear what you or others find.

I am currently working on this with Redwood in the Astra fellowship! My project is about exploring techniques with very high safety.

One of the things I've tried so far is "best-of-N," where we sample an action from the trusted model N times and have the untrusted model pick its favorite. But I suspect QA is strictly better, since T gets to exercise some intelligence in how it explores the search space, and the number of bits elicited from U w.r.t. the amount of text T writes is O(n) rather than O(2^n).

I agree we could make it less restrictive than simply binary questions. We could give T a budget for the number of bits it elicits from U, and let it specify the answer space it wants - we could default to "Yes|No", but give it flexibility in exactly what answer choices to allow U. I found a cool Python library called exrex that can count the number of strings allowed by a finite regex expression! It's easy to count bits this way:

import exrex, math

bits = lambda r: math.log2(exrex.count(r))

bits(r"Yes|No")             # 1.0 (2 possibilities)
bits(r"Yes|No|Maybe")       # 1.5849... (3 possibilities)
bits(r"[a-zA-Z ,.!?]{100}") # 588.2643... (59^100 possibilities)

Another improvement would be to let T ask "follow-up questions" in a conversation with U. U's initial answer might lead T down a new path, leading to questions T hadn't previously considered.

Possibly a better alternative to the regex idea: bits = the log-likelihood of U’s answer to the question under T's probability distribution. This lets U give any free-form response, and only parts that T finds “surprising” are considered when counting bits.

Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed "Yes", that provides T with only -log2(0.90) = 0.152 bits of information.

The simplest way to do this may be to provide U to T as a tool. We can give T a "bit budget," which it can allocate across various calls it makes to U. The inputs to the tool are T's question to U + a bit budget for this question. If U's answer exceeds the bit budget, we simply cut off its response at the last token before the budget was exceeded. The outputs of the tool are U's answer + "X bits used, Y bits remaining in your budget."

Out-of-the-box, T is probably quite suboptimal at asking questions that efficiently narrow the search space. Also, U might not always answer questions in a way that sounds like T, wasting bits. To address this, we can optimize the system prompt of both T and U, improving their skills at using their bits efficiently when asking and answering questions. We can even safely use U to help find a good system prompt to improve T's question-asking; as long as we check the prompt for suspicious instructions, it should be difficult for the U-written prompt to cause T to take catastrophic actions.

Another way to think about this is that whenever T asks U a question, it's offloading part of its chain-of-thought to U. One pretty common pattern in reasoning model CoTs is:

Come up with an initial response
Ask itself how it could improve the response
Decide on answers to those questions
Write the final response

This is exactly the format used in your paper, where you offload part 3 of the "CoT" to the strong LLM. But given arbitrary tool calls to the strong LLM, a better protocol is surely possible.

I'd be curious to see what comes from your experiments.

I also think that there's a different in the control setting in what the trade-offs are between having the small-model vs large-model "initiate". In the compression setting the small model asks a lot of questions, and this lets you not have to send the questions "over the wire", with control, you're likely fine with bi-directional flow: the small model asks for help me when it needs, or the large model interjecs when it wants.
I haven't thought too carefully on that.

Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed "Yes", that provides T with only -log2(0.90) = 0.152 bits of information.

Yeah - definitely! You can use arithmetic coding to encode those bits, according to the small model's priors. And if it turns out to be less helpful than sending the bits directly, the large model woudl be able to assess that. So you never need to send more than N bits for N questions, but could be able to do much better. I like this idea a good deal, we do mention this in the paper, but it's not a major point.

I think there are a lot of specifics to explore in the implementation of the interaction, and our work explores relatively constrained space (we did try a few other LLM-SLM interactions, and mostly got less impressive results).

One of the follow up works that are needed to make something like this valuable is more analysis on what makes 2 summaries "equivalent".

For example, this work Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , does:

LLM 1 generates a response R and a summary S
LLM 2 comes up with a series of (binary) questions about that response (Q)
LLM 3 is given response R, and then asked to answer questions Q (call this A)
LLM 3 is given summary S, and then asked to answer questions Q (call this A')

Compare A and A', and if they are the same, the summary S contains all the meaningful info from response R.

I think this is a good direction for more exploration

(note, LLM1/2/3 can all be the same LLM, just with fresh contexts)

My estimates on text traffic and model weight sizes:

Text traffic

OpenAI was doing ~8.64 trillion tokens per day in October 2025 on the API (not including ChatGPT)

Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July.

It has been 8 months since then, so a naive extrapolation suggests this is now ~16x larger.

Anthropic's Feb 2026 risk report discussed securing "multi-terabyte model weights".

So a reasonable estimate is 1-10TB of output tokens per datacenter per day, and 1-10TB model weight size.

Very helpful data points and BOTEC. I largely agree with your estimates. Thank you

OpenAI produces about 1 TB of text per day, and a frontier model is approximately 1 TB. So egress limiting alone buys us about 1 day before an adversary could steal the weights.

Wouldn't someone notice if 100% of requests to OpenAI failed for 24 hours straight though?

Wouldn't someone notice if 100% of requests to OpenAI failed for 24 hours straight though?

Yes, definitely. A few responses:

Egress-limiting sets up a information-theoretic backstop, there is simply no way to steal more than X GB in a day across a channel that allows for X GB per day. The more time this backstop buys us, the more leeway we have with our monitoring + incident response systems.
I don't think it's likely that 100% of the traffic would be corrupted, but I don't know what is the most likely world, and it's easy to linearly scale based on what security people think is the most likely outcome. (e.g. 10% of the datacenter -> 10x increase timelines).
There's a cognitive load benefit for security engineering - it's extremely convenient to think about egress-limiting because it's information-theoretic bound. You should be able to easily confirm that the egress-limiting is implemented properly because it should be <200 lines of code (or is literally just the width of a wire). And "complexity is the worst enemy of security" (post by Bruce Schneier).

To give a somewhat long-winded tangent; I find it useful to think about exfiltration in three settings:

Setting 1: Fast exfiltration. An extremely rapid exfiltration occurs before human intervention is possible. For example, an adversary compromises 10,000 servers and leaks 1/10,000th of the weights from each of them.
Setting 2: Intermediate exfiltration. A leak occurs over days or weeks - that is, a human-operable amount of time.
Setting 3: Slow exfiltration. There is a length of time (e.g., 2 years) where even if the adversary could steal the model weights in that amount of time, it wouldn’t matter by the end, because new models would be meaningfully different.

It seems like the amount of egress an attacker could use without getting caught is a function of buffers

Again, yes I agree. 2 responses:

Yes, we would certainly need to include buffers to these bandwidths, the calculations in the post are a back-of-the-envelope-calculation, and I imagine the amount of buffer required is still in the same order of magnitude. (Plus I don't work at a lab, nor would I be allowed to share such numbers.)
More algorithmically, there are rate-limiting methods that permit bursts above a baseline rate while enforcing a hard cap on the total across a longer period (e.g., 1 day). For one, there's a method called Token Bucket Rate Limiting.

Interesting! This is very different from the application you describe, but I think Question-Asking could be directly applied to AI control. I'm curious what you think about this.

Yeah - I agree.

I claim it's extremely hard for U to make T insert a backdoor just by answering its yes-or-no questions.

Seems very likely. I think you don't need to be as restrictive though. I think forcing U and T to have to communicate through any kind of information-bottleneck (not just binary QA).

I am currently working on this with Redwood in the Astra fellowship! My project is about exploring techniques with very high safety.

import exrex, math

bits = lambda r: math.log2(exrex.count(r))

bits(r"Yes|No")             # 1.0 (2 possibilities)
bits(r"Yes|No|Maybe")       # 1.5849... (3 possibilities)
bits(r"[a-zA-Z ,.!?]{100}") # 588.2643... (59^100 possibilities)

Another improvement would be to let T ask "follow-up questions" in a conversation with U. U's initial answer might lead T down a new path, leading to questions T hadn't previously considered.

Another way to think about this is that whenever T asks U a question, it's offloading part of its chain-of-thought to U. One pretty common pattern in reasoning model CoTs is:

Come up with an initial response
Ask itself how it could improve the response
Decide on answers to those questions
Write the final response

This is exactly the format used in your paper, where you offload part 3 of the "CoT" to the strong LLM. But given arbitrary tool calls to the strong LLM, a better protocol is surely possible.

I'd be curious to see what comes from your experiments.

Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed "Yes", that provides T with only -log2(0.90) = 0.152 bits of information.

One of the follow up works that are needed to make something like this valuable is more analysis on what makes 2 summaries "equivalent".

For example, this work Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , does:

LLM 1 generates a response R and a summary S
LLM 2 comes up with a series of (binary) questions about that response (Q)
LLM 3 is given response R, and then asked to answer questions Q (call this A)
LLM 3 is given summary S, and then asked to answer questions Q (call this A')

Compare A and A', and if they are the same, the summary S contains all the meaningful info from response R.

I think this is a good direction for more exploration

(note, LLM1/2/3 can all be the same LLM, just with fresh contexts)

45

Text Compression Can Help Secure Model Weights

45

Lossless Compression with LLMs

Lossy Compression

Question-Asking Compression (QA)

Future Directions

Potential Technical Improvements to be Made to QA compression

A possible future: Recursively Asking the Wisest Monk

Philosophizing

Ending Note

Acknowledgements

Appendix: Analysis and Examples of QA Compression Transcripts

45

45