Do Transformers Reason about Electric Sheep?

Do not underestimate evolution

Evolution for 3.5–4.5 billion years has developed unconditioned reflexes and instincts in terrestrial carbon life sufficient for survival and distribution. Cumulatively, these and other types of behavior are called genetic memory. Homo sapiens, in this respect, is no different from other forms of life. However, something differentiates them favorably in the ability to distinguish, evaluate, and understand what is true and false, valuable and useless, moral and immoral. Instead of two extremes, there may be an interval, but invariably one thing, this ability to distinguish is called intelligence (a noun from the Latin verb interlego = inter — between + lego — (se)lect = to distinguish, sort out — later acquired the meaning of to understand — understanding).

At what the moment did the critical leap from the slowly adapting collective unconscious occur? When did the rapid growth of intelligence, not associated with genetic memory, begin?

The Dual Inheritance Theory and the ideas of Memetics suggest that the growth of intelligence was accompanied by an increase in the amount of knowledge accumulated in the population, what would not have been possible without effective mechanisms for assimilation, transfer, and, what is more important, the derivation of new knowledge. Our species overtook evolution with the help of collective knowledgeculture. And it all began when one of the organisms projected its knowledge to another.

Another type of the dual development

Assimilation 

Assimilation of knowledge comes from the physical world, a complex system by its nature. The primary knowledge about the environment of its existence, the protohuman, like other animals, assimilated in the form of conditioned (learned) reflexes. The variety of situations gives rise to a variety of knowledge, for example, about falling objects under the influence of gravity, about pain when in contact with fire, about the fact that the discovered fruit is poisonous as a result of the reaction of the body. Such knowledge is absorbed through the joint work of the sensory and central nervous systems, and forms a world model. The activity that forms this model is motivated by universal reinforcement. A form of reinforcement is the intrinsic reward, which is regulated by the neurotransmission of dopamine in dopaminergic pathways. The degree of transmission in terms of Reinforcement Learning (RL) can be described by the scalar R (reward).

Time will tell whether we need fully human-like artificial intelligence (strong AI) or just some of the human properties that lead to general intelligent activity (general AI). At some point, we may not have enough mutual understanding with the AGI. Indeed, for complete mutual understanding, nervous systems, unconditioned reflexes, and other mechanisms given by evolution must be modeled. Moreover, the model must be computationally efficient, but accurately convey the activity of the modeled. Evolution took a huge time by human standards to form these mechanisms. This suggests that full mutual understanding may not be achievable. On the other hand, it may not be necessary at all.

Due to lack of time, learning and development of AI should occur in a simulation, and the complexity of the knowledge obtained by AI will be proportional to the physical complexity of the simulation, which includes the complexity of AI agents’ form (body) and the amount of sensory information they receive, the complexity of the simulated environment, the depth of its discreteness, the number of unique atomic blocks, etc.

Yes, we do

Transfer

Transfer of knowledge between protohumans, apparently, was carried out both: by verbal and non-verbal methods, for example, certain sounds in which meanings were invested. At present, instead of a single sound, speech concepts are represented by sets of phonemes separated by pauses, that is, words. The transition from sounds to words can be explained by the fact that the number of concepts in speech increased while the number of sounds remained limited. Mathematically, a sequence of sounds or words can be represented by a waveform amplitude vector, a sound index vector, a token embedding matrix, or in any other way. Consequently, knowledge transfer can be generalized to the transmission of a tensor I (information).

Let us now try to model the transmission of information by an individual. We will turn again to RL terms: an individual is called an agent, the concepts of state and action are introduced, each of which can be described by vectors S and A at each moment of time. Consider two contrasting situations:

  1. Some birds make a characteristic sound when a predator approaches, which encourages their relatives to move away from an unsafe place. In terms of RL, the discovering bird changes their S0, which causes the sound A0, which changes the states of relatives Si, causing their reactions Ai. During these events, the emotional state of the birds changes (fear) — roughly speaking, R decreases. A acts as I.
  2. The protohuman instinctively ate the attractive-looking fruit and discovered that it was poisonous. Suppose they know how to transmit such information to relatives. In fact, in the process of transmitting information, the agent places in I not only S (fruit) and A (eat), but also the evaluation of R (bad). This is the main difference between the two situations, as well as the difference between agents that are productive and unproductive in terms of collective knowledge: the knowledge of productive agents has a long-term value.

— The dog can evaluate things

Let us consider in more detail the previous assumption that the agent can transfer knowledge. What made the protohuman transfer and assimilate knowledge? Evolutionarily it enhances adaptation, just as evolutionarily there is no need to do it consciously. Primarily, it could be an unconditioned reflex to reinforcement. It is not hard to imagine a chimpanzee making sounds of pleasure and passively projecting emotions through facial expressions when eating a delicious fruit. Through empathy, their relatives will perceive these emotions and remember the value of the fruit.

At this stage, we have a population of agents, each of which assimilates knowledge about the physical world and passes it on to others. However, learning in the RL paradigm differs from the biological counterpart in that there is usually only one agent because the environment (simulation) can be returned to an arbitrary initial state after each launch (episode). That is, one agent can be in any situation, which removes the need for cognition of the environment by a large population of agents. The second difference is the storing of information from different episodes in a shared memory, which eliminates the need for knowledge transfer, e.g. between older and more experienced generations and younger ones. Another less obvious difference is the absence of interspecies assimilation and species as such in RL.

— Nobody knows what he told them

An example of interspecific assimilation would be borrowing by one species of animals hunting methods of another. Moreover, each of the species has its own characteristics, which give rise to qualitatively different knowledge. Own, intraspecific, and interspecific knowledge form a model of the world (learnable temporal representations), which potentially all animals with a nervous system have. At a minimum, Drosophila (~150,000 neurons) develop operational behavior. Thus, all intelligent life on Earth is one large mechanism for the assimilation and transfer of knowledge. AI can imitate this with a shared memory (collective consciousness).

— Unfortunately, we are not eternal. Source: xkcd. “A Bunch of Rocks”, part 1

The mechanisms of learning and transfer are sufficient for a group of monkeys to collectively learn how to use sticks and beat other relatives for their food but not enough to learn how to fry food because this involves the derivation of knowledge from events dispersed in time. The probability that in the process of activity, the monkey will accidentally start a fire and, moreover, a dead game will be next to the fire is extremely small, so the knowledge that the game can be cooked is complex. It is formed by several trajectories of actions and states, some of which, at first glance, have no value. How is this knowledge combined?

Source: xkcd. “A Bunch of Rocks”, part 2

Derivation

"Derivation" is used instead of the more general concept of “inference” because the distinguishing ability of a human is precisely the quality of derivation, which by definition is deductive inference.

“High level inference patterns called knowledge transmutations. Among basic transmutations are generalization, abstraction, similization, generation, insertion and replication.”

“Learning = Inferencing + Memorizing”

— Inferential Theory of Learning

Since we ideally want to achieve human-like reasoning, we need to consider and implement all categories of inference. In addition to deductive inference, inductive, abductive, and transductive are often mentioned.

Disclaimer: boring part without memes, but very important

Induction 

Induction has the meaning of the approach “(from particular) lead to (general)” (5.b, Top.105a13). It had such a meaning in ancient Greece (epagoge), now we call it a generalization. One way of generalization is averaging, and one type of averaging is the regression function (f(x) = E[Y | X=x]). That is why many problems where induction is used are called regression. Types of induction are simple generalization; analogical, causal, and sign reasoning.

* Symbolic reasoning ambiguously called Sign reasoning

Induction does not imply the order of premises in inference, it is associative. Syntactically, induction is addition:

x1 → y1
x2 → y1
x3 → y2
x3 → y3
+++++++++++
{x1, x2} → {y1}ㅤ~ㅤX1 → Y1
{x3} → {y2, y3}ㅤ~ㅤX2 → Y2

(→ denotes implication)

Artificial neural networks perform simple generalization by their nature.


Deduction

Deduction means “deduce, subtract the common parts of the premises”. It implies the order of the premises in inference, and the order begets a sequence.

The topological sorting (order) of premises forms a sequence

Syntactically, deduction is subtraction:

A → B, B → C
— — — — — —
A → C

A → B, BC, C → D
— — — — — — — — — —
A → D

Artificial neural networks in the ideal case implement the deductive Modus ponens: “{X, X → Y} ⊨ Y”— taking an input X, they output Y. And the causal attention mechanism of the generative transformer additionally implements another type of deduction — the syllogism:

{A → B, AB → C} ⊨ A_ ~> Cㅤ⇐ ⇒?ㅤ{A → B, B → C} ⊢ A → C

(~> Denotes weak implication, ⊨ denotes semantic consequence, ⊢ denotes syntactic consequence)
(⇐ Soundness of generative transformer: We can learn any semantic rule because neural networks are universal approximators, 
 ⇒? Completeness of generative transformer: Later in the article I will answer whether the semantic rule can approach strict implication → via optimization of weak implication ~>)

The causal attention takes token A to predict token B and tokens AB to predict token C. Now token A contributes to the output of token C to some extent, so that if we pass token A and empty token we probably will have token C as output.


Abduction

Abduction got its name from Charles Peirce’s misunderstanding of Aristotle’s reduction and Giulio Pachio’s too-literal translation into Latin. By reduction (apagoge), Aristotle meant the decomposition of a complex argument into simple premises and the subsequent deductive inference. Whereas Peirce, referring to the incorrect interpretation, paraphrased everything as follows:

{All men are mortal, Socrates is mortal} ⊢ Socrates is a man

~

Socrates → Mortal, Man → Mortal
— — — — — — — — — — — — — — — — —
Socrates → Man

~

A → B, C → B
— — — — — —
A → C

At first glance at the premises, induction would be suitable here, since there is a common predicate “B” in the premises. In fact, a deduction is possible here. To explain the essence of this argument, we need to use a little trick. Logic usually uses the implication (→), but in life events often have a bidirectional relationship: fire is related to smoke, and smoke is related to fire. Such a relationship can be described by the weak equivalence (<~>): smoke <~> fire. Obviously, smoke does not always follow the fire, but still quite often. Then, for a better visual understanding of “B from B subtraction”, the second premise “C <~> B” can be permuted as
“B <~> C”. Now the argument about the mortality of Socrates looks almost like a deductive one:

A → B, B <~> C
— — — — — — —
A ~> C

In the example with three premises, there will already be two permutations (exclusively for better a visual understanding B-B, C-C):

B <~> A, B <~> C, D <~> C
~
A <~> B, B <~> C, C <~> D
— — — — — — — — — — —
A <~> D

This is not a common opinion, but all premises of deduction are formed inductively. In the premise “Socrates → Mortal”, the concept “Socrates” is formed inductively from the images of Socrates: “Image_Socrates_i → Concept_Socrates”, in the same way, “Image_Death_i + Concept_Finiteness → Concept_Mortal”. On this basis, the abductive argument that Peirce used as an example is 1) the more powerful premises generated by bidirectional induction, plus 2) a type of deduction capable of working with bidirectional premises. When defining abduction, one usually refers to the first aspect (1), that is, to the bidirectional inductive nature, but this does not make abduction one of the categories of inference, rather a type of induction. It is convenient to think that each type of induction corresponds to its own type of deduction. Based on (1) and (2), we can talk about the existence of abductive induction and abductive deduction.

Is a generative transformer (e.g. GPT) capable of learning bidirectional facts? — No [1,  2], but a modeling transformer (e.g. BERT MLM) is.


Transduction

has been known in Eastern Europe since the 19th century under the word “traduction”, which is a variant of the word “transduction”. In English literature, it was introduced by Vladimir Vapnik. Statistical learning theory, page 339: “In contrast to the inductive inference where one uses given empirical data to find the approximation of a functional dependency (the inductive step) and then uses the obtained approximation to evaluate the values of a function at the points of interest (the deductive step), we will try to estimate the values of a function at the points of interest in one step”.

In terms of machine learning, this is a few-shot inference (few demonstrations, no intermediate training) — hypernetworks:
params = f(xy_unlabeled), g_params(xy_query) = labels_query. We take the unlabeled characterizing coordinates xy_unlabeled (“?”) and use a function f to generate parameters of a function g. During training xy_query := xy_labeled, labels_query := known_labels (where known_labels are A, B, C). Perhaps, if there are few parameters of g, then the transductive inference is simpler than the inductive one. But empirically, training of neural networks (many parameters) in this setting is slow and difficult.

Despite the ambition of the definition of transduction to have one step, it has two steps:
1) Deduction of f and g (Modus ponens)
2) Induction of f and g:
(xy_unlabeled_i, xy_labeled_j) known_labels_j
ㅤ(xy_unlabeled_i+1, xy_labeled_j+1) known_labels_j+1

ㅤ(xy_unlabeled_i+k, xy_labeled_j+p) known_labels_j+p

This suggests that “one”-step execution is not defining. Transduction of Vapnik is defined by the fact that it learns in functional space, and not in vector space.

Wikipedia lists agglomerative clustering and partition-based clustering as examples of transduction. In fact, according to their descriptions, these are still the same two steps: induction (generalization) and deduction (propagation). But instead of generalizing all possible points in space, we use only a sample. For example, in the first step of semi-supervised agglomerative clustering, which uses transduction, we calculate the distances between all pairs of sample points. Then we combine the clusters and propagate the labels. That is, during the induction, we generalize a finite sample by the distance attribute, and not all points in space by the label attribute. Аll clustering methods with label propagation fit in two steps. Moreover, induction and deduction can alternate. For example, in partition-based clustering, we split space (~generalize points within split parts), look for conflicts, and propagate known labels if there are no conflicts.

The metalearning schema. Source: Cloudera Fast Forward Labs

In addition to hypernetworks, there is a few-shot in-context transductive inference (several demonstrations, with intermediate training) — metalearning. It differs from hypernetworks in that we do not derive parameters from f, but copy them from f and train them up to g. GPT-3 successfully leverages metalearning in the pre-training phase.

What do hypernetworks, clustering, and metalearning have in common? They all use bidirectionality of premises, namely strict equivalence (↔). That is, if in the example with abduction, smoke does not always follow fire, then in the case of transduction: if John is Bob’s brother, then Bob is always John’s brother. This property forms equivalence classes (clusters). It enables us to propagate labels in agglomerative clustering and underlies hypernetworks and metalearning, which at an intermediate stage form a function g representing these clusters.

We need both transformers to cover all types of inference

Is it possible to learn strict equivalence in a transformer? — As much as to learn the idea of a perfect circle: it is possible with an increase in understanding = the quantity of forms and ideas, and the quality of schemas.

Transduction in hypernetworks and metalearning occurs implicitly somewhere in the network parameters and requires slow learning. But we can amplify existing transformers with explicit transduction. For example, in the Memorizing Transformers and RETRO articles, the authors use the k-nearest neighbors algorithm (k-NN), which is a kind of clustering.

The thesis

To summarize, we have shown what RL has and does not have compared to biological analogs. Upon realizing that modern RL successfully imitates assimilation and transfer of knowledge, we considered knowledge derivation, or broadly speaking, its inference. There are only two superordinate categories of inference: induction and deduction; abduction and transduction are just their subcategories. We then checked whether modern architectures, in particular transformers, implement them. Transformers must implement at least these (sub)categories to be true reasoners, and we have shown that this necessary condition is met.

See also

New Comment