DALL-E does symbol grounding

p.b.

I don’t want to write much more than the title already expresses. But considering the constantly shifting goalposts when it comes to progress towards AGI, I think this bears writing down.

In the abstract of the seminal paper [1] on the topic, the symbol grounding problem is defined as follows:

„How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads? How can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols?“

Good questions. And also valid objections against assigning intelligence or understanding to GPT-3. And a solution is sketched:

„Symbolic representations must be grounded bottom-up in nonsymbolic representations of two kinds: (1) iconic representations, which are analogs of the proximal sensory projections of distal objects and events, and (2) categorical representations, which are learned and innate feature-detectors that pick out the invariant features of object and event categories from their sensory projections. Elementary symbols are the names of these object and event categories, assigned on the basis of their (nonsymbolic) categorical representations. Higher-order (3) symbolic representations, grounded in these elementary symbols, consist of symbol strings describing category membership relations (e.g., An X is a Y that is Z).“

(1) The „iconic representations“ are pictures in the case of DALL-E, certainly „analogs of the proximal sensory projections of distal objects and events“.

(2) „Categorical representations are the text token embeddings that constitute DALL-E's input layer and the „innate feature-detectors“ are provided by the VQ-VAE that decomposes images into image tokens.

(3) According to Harnad higher order symbolic representation can now be grounded in these grounded symbols. I.e. a DALL-E scaled to the text context size of GPT-3 trained on GPT-3 training data in addition to text-image pairs.

[1] Harnad, Stevan (1990) The Symbol Grounding Problem.

http://cogprints.org/3106/