Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention by Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal of Google.

This is a pre-print of a new LLM extension with what I'd call short-term memory, but what they call infini-attention. It was published just yesterday (2024-04-10) and I came across it on XiXiDu's daily AI summary. I think it may be at a turning-point in a certain type of capability of LLMs and I want to comment on it. 

Here is the abstract: 

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

I have read most of the preprint now, and this is the closest to a model having something like short-term memory that I have seen (there are others, but the paper says this is the first LLM that doesn't fully discard looked-up memory after a step). With its incremental update rule, this model can keep track of some topic over time (1 mio token context window tested, but there is no upper limit). It learns to use these transient representations efficiently and thereby which things to "keep in mind".

The paper:

An effective memory system is crucial not just for comprehending long contexts with LLMs, but also for reasoning, planning, continual adaptation for fresh knowledge, and even for learning how to learn.

They call it infini-attention because of the incremental update, but it is really a continuous short-term memory system. Of course, it works very differently from how short-term memory works in biological brains, e.g., it seems to work at multiple layers in parallel and at a fixed granularity of tokens (2k in the tests). Still, the way it works, I think it's conceivable that such a model could show conscious-like effects even if not trained on text that contains it. In particular, because it learns to use the memory, it might notice the pattern of how it functions, at least if trained on material where such patterns are relevant. 

I don't say such an LLM is conscious or could be conscious because such terms are not well-defined enough to make such a judgment. And anyway, I don't think moral weight is tied directly to it (there are too many counter-examples). But I do think models based on such an architecture may produce more authentic conscious-like responses in dialogs (in the sense of reporting on inner experiences).

New Comment