vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models

Alan Cooney; Sid Black

TL;DR: vLLM-Lens is a vLLM plugin for top-down interpretability techniques^[1] such as probes, steering, and activation oracles. We benchmarked it as 8–44× faster than existing alternatives for single-GPU use, though we note a planned version of nnsight closes this gap. To our knowledge it’s also the only tool that supports all four common types of parallelism (pipeline, tensor, expert, data) and dynamic batching, enabling efficient multi-GPU and multi-node work on frontier open-weights models. It is also integrated with Inspect. The main trade-off, compared to other tools such as nnsight and TransformerLens, is that it’s less flexible out-of-the-box. It is however very small and extensible - it could likely be adapted to your use case and we have a Garcon style interface in the works.

We are releasing it under an MIT license here: https://github.com/UKGovernmentBEIS/vllm-lens.

Problems it Addresses

Large-model support. Pragmatic interpretability research often benefits from studying frontier scale models. For example, Read et al. (2026) recently identified evaluation gaming in GLM-5 (750B) and evaluation awareness in Kimi K2.5 (1T), but did not find the same phenomenon in smaller models. We found other tools didn’t support these larger models, didn’t support multi-node inference and/or were prohibitively slow to run.
Speed. Beyond enabling research on larger models, we also wanted to have faster iteration loops with smaller models.
Interleaving black-box and white-box techniques. It’s often helpful to study black-box and white-box techniques concurrently - for example by running probes and activation oracles alongside black-box interrogation in automated alignment audits. A common previous workflow was to generate rollouts with vLLM and then switch to HF Transformers for white-box work - a process that was slow and inflexible.

Writing distributed PyTorch code to solve these problems quickly adds complexity to research codebases, so we wanted to abstract that complexity away.

Functionality

vLLM-Lens offers high performance, supporting tensor, expert, pipeline and data parallelism (across GPUs and nodes), as well as dynamic batching. You can also use multiple interpretability techniques concurrently, in the same dynamic batch. Finally it includes an Inspect model provider, supporting techniques such as having an “activation oracle solver” in Petri or coup probes in ControlArena. An illustrative Inspect lie-detection scorer is shown below^[2], and you can see an activation oracle example here.

Comparisons with Other Tooling

To our knowledge, the closest alternative is the vLLM version of nnsight, which lacks features such as support for pipeline parallelism and the latest models^[3]. We also found the intervention graph approach challenging to debug. We note however that tensor parallelism support was recently added, and further improvements are in the works that significantly increase performance.

Other approaches include using HF Transformers & hooks directly, or Transformers based tooling such as TransformerLens, standard nnsight or nnterp. These approaches suffer from HF Transformers being on the order of 10× slower than vLLM and less memory efficient. They also require more performance tuning than vLLM - e.g., setting the batch size manually.

Single-GPU Performance

To estimate the single-node performance differential versus other libraries, we generate 1000 completions from prompts in the Alpaca dataset, with Facebook Opt-30B, extracting activations from all tokens for a single layer in the residual stream. We use default settings for all libraries, attempt to follow their documentation when available and optimize batch sizes to prevent out-of-memory errors^[4], where necessary. We find vLLM-Lens to be 8.1x faster than native HF Transformers, 10.6× faster than the current nnsight vLLM version^[5] (0.6.3) and 44.8× faster than TransformerLens for this task. vLLM-Lens was ~20% slower than pure vLLM (with no activation extraction). We note there is a new version of nnsight vLLM version being developed that is substantially faster, bringing it broadly in line with vLLM-Lens for single-node use.

vLLM lens is ~20% slower than native vLLM without activation extraction, but 8.1x faster than the next-closest alternative (HF Transformers library with hooks).

We note that benchmarking of all tooling was done on the Isambard cluster, which may bias results, as we optimised vLLM-Lens for performance using the same cluster. In addition, nnsight’s remote execution capabilities were not benchmarked here. Conversely, we anticipate that this may substantially underestimate performance benefits for realistic auditing scenarios, as vLLM-Lens excels in scenarios where you apply different operations (e.g., steering, probes and black-box interrogation) to different samples, in the same dynamic batch.

Multi-Node Performance

For an indication of multi-node performance, we compare performance with vLLM-Lens on a variety of models below. This is done on a task that involves evaluating 3 different lie-detection probes on the Roleplaying dataset (371 samples), using a cluster with 4xH100 nodes. We were unable to benchmark nnsight vLLM on multi-node setups due to out-of-memory issues with small models and moderate sample sizes (>100).

Model	Parameters (B)	Nodes	PP	TP	Time to run the full evaluation (mins)
Gemma 3 27B	27	1	1	2	1:58
GPT OSS 120B	120	1	1	4	1:56
DeepSeek V3.2	671	4	4	4	3:22
GLM 5 (FP8)	745	5	5	4	5:43
Kimi-K2.5	1000	4	4	4	4:26

Limitations

An important downside of vLLM-Lens is that it provides a relatively small subset of all possible top-down interpretability techniques, currently focussing exclusively on interaction with the residual stream. We’ll extend features as we find more use cases, and we’ve found coding agents can also relatively easily add additional hooks, so if you’re working with large models and/or need faster inference and feedback cycles, it may well be useful for you. By contrast for other use cases you may find nnsight or TransformerLens to be a better fit.

Technical Approach

The vLLM plugin system isn’t well documented and we found that coding agents struggle to reason about vLLM internals, so we provide a brief overview of the technical approach here. vLLM-Lens registers as a vLLM plugin and injects itself into vLLM's processing pipeline in 3 locations:

Intercepting generate calls. To utilise the plugin, you can pass extra args such as output_residual_stream or apply_steering_vectors in the sampling parameters. The plugin extracts these, initialises relevant PyTorch hooks if they're not already set up (by adding a worker extension) and sends steering vectors directly to workers (vLLM typically has one worker per GPU).
Per-sample hook operations. vLLM dynamically batches tokens from multiple concurrent requests into a single forward pass, so a core challenge is "book-keeping" - working out which operations (e.g., activation extraction) should be applied to which parts of the forward pass. To do this we read the forward_context metadata, utilising the query_start_loc (a tensor of token boundaries per request) and req_ids (mapping batch index to request ID). We then, for example, apply steering to just the slices that correspond to the sample that requested it. Any extracted activations are moved to CPU RAM and compressed (lossless), ready to be requested by the vLLM scheduler process once generation for that sample has completed. Steering runs on all tensor-parallel ranks (since it modifies the forward pass), but capture operations only run on TP rank 0 (the residual streams are identical across TP replicas after all-reduce).
Response collation. The plugin intercepts the response before it is sent to the client, at which point it queries the relevant vLLM processes for any requested activations. It trims surplus activations, as vLLM does under the hook with tokens (the scheduler often gets ahead of the number of tokens it needs to generate, before stopping). Activations are then returned to the client.

Credits

Thanks to Satvik Golechha for the original idea of doing this with vLLM, and the nnsight team for inspiration. Thanks to Walter Laurito and Geoffrey Irving for valuable feedback.

^{^}
Defined as attempting to locate or alter information in a model without full understanding of how it is processed.
^{^}
In practice it’s more typical to run probes on a subset of generated tokens, but the scorer here runs on all tokens for simplicity.
^{^}
At the time of writing, it supports vLLM 15.1 only.
^{^}
vLLM automatically determines an appropriate dynamic batch size during execution (a behaviour inherited by vLLM-Lens). For the Hugging Face Transformers, nnsight (transformers version) and TransformerLens libraries, we instead perform a simple search procedure: beginning at a batch size of 512 and iteratively halving until the run completes without GPU out-of-memory errors, after which we report the runtime of the largest successful configuration. For nnsight (vLLM backend), dynamic batching follows vLLM’s default behaviour and does not trigger GPU memory issues; however, CPU memory limits can still be encountered, which we resolved by manually calculating the most efficient batch size.
^{^}
We think this was likely mostly due to the issues addressed by https://github.com/ndif-team/nnsight/pull/652 , and that we had to enable batching to avoid out-of-memory issues as a result of these issues. A provisional experiment with the version of nnsight from that PR found performance to be the same as vLLM-Lens with a single-GPU test, but nnsight was 1.9x slower with a 4-GPU test (TP=4).

75