Should we kill recurrent memory over self attention ❓

Spending most of my time on time series problems, I often think about the consequence of memory and the sequential nature we are exposed to in the physical world.

Memory is the idea for a learning algorithm to store a representation of the system's state over time. Think of how much you remember from what you learned a week ago. Memory is partially observable today and naturally, in our conscious experience, large parts of the experience tend to fade.

Im going to discuss two main differentiable programming (deep learning) paradigms: Recurrent Neural Network and Transformers. Then discuss the premise for my question.

Recurrent differential programs (RNN's) enable memory to be transferred across sequential states. Imagine the beat of you heart being processed at each second and then only remembering a specific part of this pattern. This memory is useful to predicting the next state of a system. This mechanism is written in a program and gives rise to not only memory, but introduces the idea of learning how to remember.

Transformers (used in GPT) introduce a mechanism called self attention to act as memory. The advantage is its ability to compute months of your heart beat signal in one parallel swoop as well as its ability to study global dependencies in a sequence. But it poses its own design challenges. We usually need to keep sequential order of our data and we may want even more control over how the algorithm reads and writes from memory to solve a problem.

This posses two schools of thought:
1️⃣ Explicit memory register
2️⃣ Self attention mechanism

3 more quick priors before my hypothetical question.
➡ Differential programs encode data into representations that are easier to learn from and it decodes them to make predictions.
➡ Universal Transformers is an architecture that enables the use of recurrence in either the encoding or decoding phase to provide more memory control.
➡ Neural Turing Machines have an external memory register to essentially design algorithms with memory usage of arbitrary complexity. Though this area of research has gone quite lately, ideas still loom.

These are questions I would love more insight on: 

❓ In a world where a Transformers can also exhibit universal properties with granular memory control, what is the case for sequential or recurrent memory?

❓ If there is still a case for recurrent memory, are Transformers doing to memory what Differential Programming did to Deep Learning? In other words is this abstraction of memory adding complexity?


New Answer
New Comment