Or, Why Conceptors Are So Damn Cool.
This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by an exploration of (structural) similarity in the same context of coupled optimizers.
Thanks Alexander Oldenziel, Herbert Jaeger, and Paul Colognese for discussions which inspired this post.
Conceptors, as introduced by Herbert Jaeger in 2014, are an extremely underrated formalism at the intersection of linear algebra, Boolean logic, and dynamical systems, with a host of direct applications in prosaic alignment. For instance, conceptors can help (1) identify ontologies from latent activations, (2) steer sequence models by whitelisting or blacklisting dynamics, (3) shield internal representations under distribution shift, and many others. Their lineage can be traced all the way back to Hopfield networks, and they remain more mathematically-grounded than many ML-adjacent formalisms.
This post is mostly an overview of conceptors, the primitives they expose, and their low-hanging fruit applications in prosaic alignment. Accordingly, most of the ideas presented here can be found in the linked papers, one of which I was excited to co-author. Besides offering a walkthrough of those existing ideas, I'll also explore how they might apply at different levels of the optimization stack, similar in style to the previous exploration of stability and structural stability.
This section will only provide a high-level intuitive understanding of conceptors. If instead you're hungry for precise definitions, please refer to this section or this page, though keep in mind that they have originally been formulated in the specific context of RNNs, and the terminology might be slightly confusing.
Now for the hand-waving. A conceptor is a mathematical object which captures the spread of a state cloud across dimensions of the space it populates. A state cloud, in turn, is simply a set of high-dimensional vectors. Much like the outputs of PCA, conceptors can be represented as high-dimensional ellipsoids specifying a coordinate system of their own, distinct from the coordinate system of the original space. In contrast to vanilla PCA, conceptors have an additional "aperture" parameter which enables much of the neat mathematical constructions which follow. This parameter controls how strongly a conceptor adapts to the structure of the underlying state cloud.
An equivalent machine-friendly representation of a conceptor is of matrix form, essentially consisting in the stacked vectors which define the new reference frame. Not unlike the correlation matrix used directly in PCA, conceptors can be obtained from underlying state clouds in closed-form and in sublinear time with respect to the cardinality of the state cloud. Typically, the state cloud is composed of model states (i.e. latent activations) originating from an ML model, such as an RNN or transformer. Therefore, conceptors are often — but not always — used across activation space.
Here's where things really stop looking like rebranded PCA, though things might translate back to PCA. The disjunction (OR) of two conceptors is defined as the conceptor obtained through the union of the two associated state clouds (see left graph below). The negation of a conceptor is defined as the conceptor which spreads in the opposite manner across each dimension. For instance, if the original conceptor represents high spread across dimension 1, but low spread across dimension 2, the negated version will capture the reverse pattern (see right graph below). The conjunction (AND) of two conceptors is defined in a slightly round-about way using De Morgan's law, a theorem which frames conjunction in terms of disjunction and negation (see middle graph below). The logical difference then yields the contrast in spread between two conceptors, and can be obtained using conjunction and negation. A rigorous 7-page definition of the Boolean ops described above can be found here. Additionally, a ~20-page grounding of the link between dynamics captured by conceptors and formal logic can be found here.
If you collect state clouds of latent activations arising in an LLM when processing different noun phrases (e.g. "apple", "orange juice") and learn one conceptor for each such symbol, then you can use their geometry to infer relations of abstraction. For instance, the "fruit" conceptor appears to spatially engulf the "apple" one, as the former contains the meanings of the latter, but some additional ones as well. You'd need to take the disjunction of "apple" with some other instances of fruit in order to approach the very concept of fruit. By working with state clouds of raw embeddings, the method has the benefit of being modality-agnostic (i.e. it should work with Vision Transformer or Decision Transformer as it does with a language model, with minimal modifications). In contrast, interpretability techniques based on witty prompt engineering attempt to elicit entities and their relationships as text output, therefore being text-only.
Reference: Nested State Clouds: Distilling Knowledge Graphs from Contextual Embeddings
If you incrementally train a model to internalize different dynamics (e.g. language modeling on different datasets perhaps), then you can roughly measure the "memory usage" of the model at any given point using the logical difference of the disjunction of "learned stuff" conceptors and its negation. When overparametrized and freshly initialized, each new dynamic being internalized will incur costs of representational resources. However, when new dynamics build on previous ones, then the "memory usage" doesn't increase as much, as representations are being recycled. This might be used to detect whether a specific dynamic has been internalized by a model, and as a way to gauge the representational budget.
Reference: Controlling Recurrent Neural Networks by Conceptors / Neural memory management
If you elicit a certain dynamic from a model in different situations (e.g. deception, mesa-optimization), learn conceptors from the associated model states, take the conjunction of those conceptors, negate it, and finally insert it into the recurrent/autoregressive update loop, then you might be able to suppress the specific dynamic. This basically enables blacklisting of specific problematic dynamics. Drop the negation and go for disjunction if you want to switch to whitelisting.
Reference: Human motion patterns learnt with conceptor-controlled neural network (an instance of inserting a conceptor in the update loop)
If you train a model to internalize certain dynamics and compute the associated conceptor based on said dynamics, you can then tweak the backpropagation algorithm in order that the previous representations are generally shielded from future updates, by forcing learning to make use of different representational resources. Code knowledge across other dimensions as much as possible, and don't tamper with those specific old latents.
Reference: Overcoming Catastrophic Interference Using Conceptor-Aided Backpropagation
If two different models exercise different dynamics, and you compute one conceptor for each, then the similarity between conceptors can be used as an effective distance metric between the dynamics implemented by the models. More generally, conceptor-aided similarity can help gauge the spread of models trained in different regimes, which might be useful for quantifying trainer stability. In contrast to KL-divergence on outputs, this approach would enable more a measure of "cognitive", rather than "behavioral" difference, by comparing latents instead of ~policies.
Reference: Controlling Recurrent Neural Networks by Conceptors / A Similarity Measure for Excited Network Dynamics
If you can compute a conceptor based on the dynamics employed by a model, can't you similarly compute a conceptor based on the dynamics employed by a trainer in moving models around trainer spread? Instead of model states, you'd work with model parametrizations. That might help quantify the stability of a trainer2, by measuring the variation in the dynamics of various trainers obtained through their automated meta-optimization.
Say you used neural memory management techniques to determine that a model has already internalized a certain dynamic. If you collect multiple instances of trainer trainers yielding trainers which yield models which have internalized inappropriate dynamics (e.g. mesa-optimization), then you might want to isolate and suppress the dynamic of training towards dynamics of deception. You might also want to mix and match different training dynamics using conjunctions and disjunctions.
This one is pretty weird. Say you have a trainer moving models across model space. A trainer trainer moves it across trainer space. In other words, the trainer2 changes the way the trainer moves models across model space. You might want to preserve some specific training dynamic during the optimization exerted by the trainer2. For instance, human amplification loops might be vulnerable to human modification orchestrated by downstream levels of the optimization stack. Shielding some aspect of the initial trainer dynamics (i.e. inducing some path-dependence on trainer2), might be protect it from being arbitrarily overwritten, especially when the trainer is itself trained with backprop. However, the corollary is that human amplification might be hurdled.
Similar to how you might check whether a model has already internalized certain dynamics, you might also be able to check whether a trainer being trained has happened to internalize certain dynamics. This hints at some nested boxed setups where automated mechanisms are employed to detect dynamics at various levels in order to e.g. shield and isolate, similar to the speculative applications above.
Similar to one how might infer relations of abstraction between model states coding for different concepts (e.g. fruit > apple), it might be interesting to investigate whether certain classes of trainers only implement subsets of the dynamics (across model space) enabled by a broader trainer class.
Conceptors are inherently linear. While non-linear variants have been hinted at in conceptor literature, it's difficult to see how the neat mathematical constructions (e.g. Boolean ops) might transfer. Conceptors might not be expressive enough to capture the intricate high-dimensional non-linearities present in contemporary prosaic models. For instance, the interpretability application above already hits related obstacles:
Besides, even if quote computationally-efficient in general, conceptors still run into computational constraints as the state cloud's dimensionality increases. Computing individual conceptors across a model space of 175B dimensions is currently intractable. However, forthcoming work suggests the feasibility of inserting local conceptors at different locations in a larger model. I've heard a mention of John Wentworth mentioning local PCA which might also be related, but I can't find a good reference.
"Different agents may have different logics. I will therefore not try to define a general [intrinsic conceptor logic] that would fit any neural agent. Instead every concrete agent with a concrete lifetime learning history will need his/her/its own individual conceptor logic. [...] Making an ICL private to an agent implies that the model relation ⊨ becomes agent-specific. An ICL cannot be specified as an abstract object in isolation. Before it can be defined, one first needs to have a formal model of a particular agent with a particular lifelong adaptation history. [...] In sum, an ICL (formalized as institution) itself becomes a dynamical system, defined relative to an existing (conceptor-based) neural agent with a particular adaptation history. The “state space” of an ICL will be the set of signatures. A “trajectory” of the temporal evolution of an ICL will essentially be a sequence of signatures, enriched with information pertaining to forming sentences and models." — Herbert Jaeger
"Attractors, by definition, keep the system trajectory confined within them. Since clearly cognitive processes do not become ultimately trapped in attractors, it has been a long-standing modeling challenge to account for “attractors that can be left again” – that is, to partly disengage from a strict DISA paradigm. Many answers have been proposed. Neural noise is a plausible agent to “kick” a trajectory out of an attractor, but a problem with noise is its unspecificity which is not easily reconciled with systematic information processing. A number of alternative “attractor-like” phenomena have been considered that may arise in high-dimensional nonlinear dynamics and offer escapes from the trapping problem: saddle point dynamics or homoclinic cycles; chaotic itinerancy; attractor relics, attractor ruins, or attractor ghosts; transient attractors; unstable attractors; high-dimensional attractors (initially named partial attractors); attractor landscapes." — Herbert Jaeger
Conceptors provide a useful formalism for operating with high-dimensional dynamics, similar to the ones dominating contemporary prosaic architectures. The way they combine the discrete nature of formal logic with the continuous nature of dynamical systems makes them promising candidates for steering high-dimensional dynamics using crisp controls. Their flexibility leads to many low-hanging applications, and probably even more waiting at higher levels of complexity.