This is a post on a novel cognitive architecture I have been thinking about for a while now, first as a conceptual playground to concretise some of my agent foundation ideas, and lately as an idea for a project that approaches the Alignment Problem directly by concretising a sort of AI-Seed approach for an inherently interpretable yet generally intelligent architecture on which to implement a notion of "minimal alignment" before figuring out how to scale it.
I don't strongly expect this to work, but I don't yet see where it has to fail and it feels sufficiently fresh/untried to at least see where it breaks down. 
I omitted most of the technical details that I am considering, since this post is about the conceptual portion of the architecture, and purposefully not fully polished (to get around some of my publishing anxieties). 
I am hoping to treat this as a living document or to make follow up posts of updates and larger clarifications. Because of that, I really appreciate criticisms and challenges to any ideas you find in here. 
Anyway, here goes:


The Prop-room and Stage Cognitive Architecture is a neuro-symbolic architecture for embedded agency, PSCA for short. It is inspired from the observation that the human brain seems to run an ongoing simulation of the “current scene”, in which many sensory and emotional modalities are unified into a single referential context. Furthermore, the level of resolution at which elements of this simulation are being simulated depends on their relevance to the current objective that the human is attending to. 

The proposed architecture aims to emulate this particular ability of the brain by specifying a “Stage”, which is a hypergraph representing the “current scene”, that contains present elements of relevance-adjusted resolution. The features of those elements are parameter nodes that are connected by edges corresponding to the causal relationships within the current scene. 

The idea is now to use constraint satisfaction in order to increase certainty about beforehand uncertain nodes in the graph, which includes nodes referring to the agent’s next action. The current objective is also part of the scene, playing an important role in constraining the expected action. One interesting feature about this architecture is that it naturally concretises objectives, so that the general goal of e.g. finding something to eat becomes the concrete goal of navigating to a visible piece of fruit in the context of the current scene.

In order for the stage to function as intended, there needs to be a repository of possible elements and resolutional variations thereof from which the stage can be populated and updated. This function is served by the “Prop-room”, which is another hypergraph of a much larger and more static nature, that contains all of the “world model pieces” that the agent acquired so far, perhaps ideally stored in a hierarchical structure. If the agent observes an object (or “behavior” for that matter), there will be a pattern-matching process in the prop-room to determine whether the object is new or familiar. In the former case, a new entry will be formed, and in either case, the matching entry will be copied at a relevance-adjusted resolution to the stage. 

There is a learning algorithm at play here, through which the agent updates and creates the models in the prop-room, including their contextual relevance, based on other factors of a scene. The perceptual learning will probably involve a contemporary neural network(s), making this a neuro-symbolic architecture. During the initial stages of development, perception can be performed by a handcrafted system, so the desiderata regarding a later stage perceptual subsystem will be clarified at a later point.

The reason why hypergraphs are sort of the symbolic primitive of this architecture is that they are fully computationally general (can represent any algorithm) while being a relatively interpretable data structure for causal modeling. There are a number of operations that are quite natural to run on them, like constraint satisfaction and pattern matching (computing the differences between two subgraphs), and it is relatively intuitive to imagine larger structures, like the integration of different sensory modalities as two subgraphs with sparse connections/edges that determine the relationship between different stimuli (e.g. the sound of a voice indicating the presence of a person at some vaguely specified location).


The Stage

The Stage is in large part intended as a “relevant world model generator”. Like all embedded agents, the PSCA agent has to perform computations about a more complex system than itself, the environment (including itself), in “real time” in order to decide on sensible actions. There is a trade-off between modeling in more detail, modeling more quickly and modeling in parallel to entertain different possibilities under uncertainty. The Stage is supposed to solve this trade-off and present the most relevant accessible world model for the next decision.

We can imagine the process on the Stage in 3 steps:

  1. The current state of the Stage is generated based on multiple factors
    1. Predictions made by the previous state of the Stage
    2. Partial environment state as perceived by the sensors and processed by the Prop-room
    3. Current concrete objective
    4. Reference frame
  2. Current state of the Stage is improved through constraint satisfaction
    1. The values of nodes and the certainty in these values can be used to constrain the expectation of the values of other nodes that they share relational edges with
    2. The idea is to distinguish between low probability and low plausibility, in terms of more certain nodes statistically contradicting each other
    3. One can introduce nodes that are always very uncertain or always very certain to shape the rest of the graph
  3. Current state of the Stage is used to predict the next state of the stage, which crucially involves the prediction of the agent’s own action (which is used to determine the action but needn’t be the only influence on it)

I’d like to note that this is intended as a very flexible design, as it allows both for the determination of minute “micro-actions” in very short time, as well as more abstract actions that involve a larger reference frame (more on planning in the Training section). One way of imagining normal operation in a complex environment is to think of larger stage states setting the reference frame for smaller ones, and having multiple nested layers of total Stage-resolutions with the larger ones interspersing the greater number of smaller ones, with some conditions regarding when to initiate which size/scope for the next generation. There are different possible implementations, e.g. ones that favor parallelism vs those that favor a single unified stage progression. 

Similarly to how it works in the brain, it is plausible that the heavy lifting for determining the “next state” will be done by the prediction from the previous state, with the perceptual cognition more so as a corrective influence than a major contributor of Stage-elements. This kind of predictive simulation somewhat depends on shared reference frames between adjacent Stage states, though it is relatively computationally simple to reduce the scope/resolution of a given reference frame.

Learning can be introduced at multiple points of the Stage. Whether it is about determining the right balance between different Stage resolutions, determining the conditions for switching reference frames, the conditions for favoring the predicted sub-state over the perceived sub-state, or which concrete objective to pick, all of these can be finetuned over time based on total performance or coherence or some other metric. Researching an architecture like this would involve a lot of freezing of some learning processes while studying the interaction of single or few ones, and strikes me as a significant challenge particularly because many of these learning processes would influence the context of the others. It would be open to the researchers to supply a fixed algorithm rather than a learning one in some of these places.


The Prop-Room

The Prop-room is the place where the agent’s models about reality are stored. It is not a memory of events or places, or even particular objects, but rather a potentially giant hypergraph that contains information about how encountered concepts are related to each other. Different objects that the agent encounters will be represented as subgraphs, with the depth of the subgraph roughly corresponding to the depth of understanding about the respective more granular elements/concepts making up the object by means of their relationship to each other. This is not limited to physical objects, and can also involve concepts like forces, incentives, numbers, behaviors, etc.

The function of the Prop-room is to provide the ingredients for the predictive simulation happening on the stage, and its development is quite teleological in that it expands and changes based on the usefulness of the concepts that it contains. 

Initially, we can imagine the Prop-room in a very superficial state, perhaps manually set up by the researchers, a couple of nodes and edges corresponding to some elements that the agent might encounter in its environment. This correspondence is achieved through perceptual recognition, namely that some perceptual stimuli will be matched with entries of the Prop-room, prompting their inclusion within the next Stage-state. 

A major learning process for this architecture is to match compositions of sensory stimuli with appropriate/useful concepts in the Prop-room, and to also recognize when there is no sufficiently close match, which would initiate the creation of a new entry in the Prop-room. 
As the agent learns more about its environment, the initial graph expands and branches out, always connecting new concepts to already existing ones.

What is important to understand about this, is that the meaning of a new concept/entry is entirely contextual, based on how it is connected with existing content. Things are defined in terms of each other, the contents entirely meaningless if not for the structure relating them. The straightforward inclusion of novel concepts is a challenge for this architecture, since anything new can only be understood in terms of what is already known. Some further thinking/research is necessary to determine the expected degree of coherent (and hopefully interpretable) knowledge/understanding formation.

The learning for the Prop-room is intended to be set up in such a way that, over time, sufficiently similar concepts/elements will be integrated/merged to avoid redundancy, which can still keep options open through e.g. the partial “activation” of a subgraph that is unifying the two concepts. The Prop-room should not hold separate subgraphs on the different types of trees that it has encountered, but rather converge towards a “prototype” of what a tree is, that is slightly altered/specified based on the specification of the exact type. A further specification, that happens only in the transmission from the Prop-room onto the Stage, would be something like a particular branch structure and other details belonging to a specific tree. 
The idea is that a single concept of “tree” in the Prop-room can be utilized and specified to spawn entries of particular trees on the Stage, including the causal and statistical connections between their features.

I currently believe that the Prop-room could lend itself very well to the formation of natural abstractions, with different categories of abstractions corresponding to salient levels of resolution that retain the relationships between concepts such as to enable high-level decision making. 
As an example, the ability to categorize something as an obstacle or a resource depending on the context can be very useful. Provided an abstraction hierarchy in the Prop-room, the learning process can straightforwardly approximate the correct level of “zoom”/resolution when loading something into the Stage, essentially deciding on which details to consider. 
Of course, sometimes one might care about an object only very superficially, except for a very particular detail about it, which is difficult to represent in this hierarchical frame unless said detail is frequently important and thus weighed differently (=expressed even in the low resolution representation). 

There are many raw ideas on how to select for beneficial structure in the Prop-room, e.g. how to dedicate a section to values/objectives and make sure they stay there while having a principled relationship to the rest of the network, how to efficiently find redundancy over larger distances within the graph, or how to enable interpretable acquisition of language, but these are better left for a future post and future discussion. Suffice to say for now that the symbolic nature of the architecture allows us a constructive interface for altering/testing various goal-content relationships, among other things.

Concretely, we might have a node that represents current (expected performance) and that is connected to three other nodes that roughly correspond to the parallel objectives that we want our agent to track. The edges of these three to the initial node might correspond to the proportional importance of the parallel objectives. 
We could start out teaching (+rewarding) the agent to feed itself, but later cut that edge. Now, various concepts around food/resource acquisition are still present in the overall network, and they can be called upon for instrumental reasons. I don’t know if this is cleaner than just letting the agent instrumentally acquire self-sustaining habits/behaviors while pursuing different objectives, but I suspect that it would lead to less interpretable and possibly less general conceptual structures. 
Imbuing the agent with a very clean and general concept of self-sustainment could be very beneficial for grounding/contextualizing later more abstract concepts.



Since the Prop-room is not intended to contain a map of the environment or store very specific objects/scenes, there will be a need for a separate module that contains this content. Maps, specific instances of objects that are unusually distant from the common/ proto-type, or entire scenes that would be computationally inefficient to reconstruct on the Stage every time they are encountered (i.e. because they are relatively static but complex) - the memory is a bunch of data structures tracking relevant information that can’t be efficiently recovered by means of simulation on the Stage. 

Still, one can expect many of these stored memories to only be partially expressed, since some of their content can be reproduced/reinferred by the Stage, and thus don’t need to take up extra storage capacity. Somewhat like seeds or keys for the simulator to unfold, the more compressed the more advanced the simulator is. We might ultimately expect the PSCA to find an efficient trade-off between how costly some data is to store vs how costly it is to generate (both at the level of resolution/granularity that is relevant).



This is fundamentally an “AI-seed” setup. 
The PSCA is intended to be trained in a simulated environment of growing complexity/difficulty, which could be accomplished by various “game levels” or a more smoothly continuous progression in “cognitive stress” that the PSCA is subjected to. One of my objectives with this research is to develop and update a predictive theory that we can constantly test, and that allows us to alter the (complexity) growth of the environment in order to achieve certain learning results. Those could be the acquisition of some general concepts that we are selecting for the PSCA to learn, or to build up habits or a resistance to certain classes of behavior.

To be clear, this means that designing the training environment might be the most challenging part of this research, as the various features of the training environment, combined with the objective that the agent pursues, shape the deeper concepts and abstractions that the agent ends up learning. I am thinking about formalizing this so that we recognise clearly that this is about confronting the agent with various information patterns that will elicit certain cognitive/conceptual structures over time - are we teaching our agent concepts early on that are sufficiently general and refineable, or are we introducing habits and patterns that are overly specific to the current environment? Arguably, the former would be quite beneficial for a sort of developmental interpretability. 
It’s certainly non-trivial to figure out the order by which scalable/general concepts can and should be taught, relative to different developmental stages of the agent.

I am very much imagining later variants of this system to control some body in a virtual environment like an island or valley, navigating through various challenges across different time horizons, like building shelter, and finding optimal routes for acquiring regrowing resources. How tractable the cognition of this agent is to interpret, is ultimately dependent on how “clean” its learning experiences/trajectories were.


I am particularly interested in exploring formalisations of environmental teaching of learned planning, if this architecture shows promise in basic areas. In a way, we can think of a plan as a very abstract, initially underspecified action. 
It is a nested representation of constraints, e.g. what I want to achieve in a year constrains what I should do in the respective months, which constrains my week plans, etc. until I have a decent idea of what I want to accomplish on this very day. 

What people sometimes miss is that we have constraints arising from the bottom as well, i.e. what I can accomplish in a day, mutually calibrating against those high level constraints in search of a resonant equilibrium (check out adaptive resonance theory if you like) that clarifies the intermediate layers of the problem (in this case perhaps on the months/weeks scale), before concretising the realistic and actionable year plan. Or instead of a year plan, we can use this to generate a high level map of the explored environment - the process can be generalized to any complex search process with constraints at different resolutions. 
The general idea is to generate salient intermediate layers of integration that are found through such a process, use those to select for actionable high-level objectives, to finally select for suitable action at the low level. A carefully conceived curriculum for the agent to learn when to engage on what level of model refinement seems plausible to conceptualize.


On neural representations

It does not seem necessary that richer/more abstract concepts necessarily require more complex neural encoding schemes. We could try to use the neural perception layer to translate the various sensor inputs into a more limited but high resolution language of encountered patterns, like objects and movements or foreground and background, loaded with certain features like color, size or direction, to fill in potential parameters in our symbolic representations. Crafting this language (or directing its emergent development) may not be trivial, but the real world (as well as most suitable training environments) contains many regularities that are "referentially contained" and should allow for a "approximately correct" simulation in a simpler descriptive language than the weird quantum graphs that are actually causing things. 
More than that, once an AGI seed has matured to a certain point, it can take a role in updating/upgrading its perceptual system to address common or critical challenges that arise from its imperfection. I would further pursue this avenue to get around the inscrutability that might arise on the subsymbolic layer. We might even not use a neural network at all.


Beyond the Stage

The PSCA is intended to have additional core modules during later development, like a second stage that is not strongly entangled with perception and allows for arbitrary scene construction or counterfactual reasoning. I have many ideas about this, but it makes sense to get the prototype going to test and validate more primary ideas.

To give some sense of direction, I believe that the prop-room stage setup is fundamentally capable of expressing general intelligence/competence, and that any additional modules mainly help with interpretability of the system, and potentially make it more efficient (which might matter a lot in the end).

For instance, I think that it is quite natural for a growing mind like this to develop a sort of proto language in terms of symbols/objects that are realized in the environment, like how unexpected smoke signals something bad, or a landmark allows one to keep track of a resource. So, in some sense, these symbols have a meaning, have an implication, and it is only an association away to utilize this symbolic function as a sort of outsourced cognition. The agent can create landmarks itself, or even set up dynamic processes that will produce a signal upon their completion. Once we are in the territory of “I can connect this distinct signal to an arbitrary meaning that would be useful to be informed about” (aka mutual information), I think I know how we get to higher order language, and can comfortably introduce NPCs that use simplistic language for the agent to pick up on, and conceptually associate with those formed structures. 

This should suffice to understand how language can emerge as a natural part of the processing on the Stage - but the system might still benefit from a more dedicated linguistic module, perhaps as a highly specialized variant of the general Stage.



So, what does all of this afford us? I mentioned a couple of things in the training section, but here are a few more general considerations.
For one, I would expect the study of “agents generating predictive, relevance adjusted simulations” to wield generalizable insights about embedded agents, as many of the dynamics spelt out for this architecture are hypothesized to emerge in some way within other embedded agents as well. This is a way for both understanding and influencing these features and dynamics, e.g. capping the maximum scope of reference frames that the agent will consider, rather than leaving it up to the learning algorithm to approximate the maximally useful scope for a given environment. 

Additionally, interpreting a single state of the Stage is relatively tractable, provided some adequate understanding of what the different nodes and edges refer to. The objectives are concretised and explicit as an element that is considered in isolation in the process of generating the next Stage state, allowing clearer interpretation of what the agent is “trying to do” during any particular Stage state. 

My current thinking around “alignment” suggests that strict control of a superintelligence is not feasible, and that it is more promising to solve for a mature notion of alignment in the minimal setting that can hold this notion - and then figure out how to expand this into less minimal settings, by figuring out the principled operations by which a setting can become more complicated, from a causal modeling perspective. I am currently thinking about 4 primarily, which are: 

  1. Some rules or relationships of elements of the environment change
  2. A new level of resolution is introduced to a given set of subsystems
  3. An entirely new element is introduced
  4. Greater quantity of known elements/relationships, “same but more”

Please let me know if you can think of any other categorical changes in the environment that may make the agent update its models, and which can’t be expressed with the above 4.

IF we actually solve for what we actually mean by alignment, but in a minimal setting, THEN there might be principled ways of extending that alignment across any sufficiently incremental dimension of “environment complexification”, to ultimately approximate from the solved virtual environment to the real world.

If we could, e.g. balance an objective generator around “helpfulness” against an objective generator around “non-interference”, and also describe or figure out how to update such objective generators along each of those four axes, in principle, I would feel much better about training an aligned, roughly human level or mildly superhuman AGI and studying/consulting it for how to proceed.

There is more to write on this, the ideas feel somewhat low resolution, but potentially worthwhile to concretise. 


Initial research project for the prototype

The main thing I care about for prototyping the PSCA ideas and starting with implementation, is to get a small Stage and Prop-room combination working in simple environments like 2d games or a simplified Minecraft setting. I am already working on this, but it is not my most efficient activity.

Main prototype goals will be about:

  • Building it and looking at the bottlenecks and whether there is anything surprising
  • Evaluate the hypergraph representation in practice
  • Figure out the learning mechanisms for 
    • Updating the models in the Prop-room according to new evidence
    • Making an ideal, resolutionally adjusted selection for how to load a given scene into the stage

I don’t care that much about efficiency at this point, nor about how to e.g. best turn sensory data into hypergraph representations, and would instead let a separate system take care of that for now or directly feed the data to the system in a suitable format.
I care a bit about performance, in that I would be surprised if there were any fundamental bottlenecks to optimizing performance at various simple games.

New Comment
4 comments, sorted by Click to highlight new comments since:

Interesting! I'm not following everything, but it sounds like you're describing human cognition for the most part.

I found it interesting that you used the phrase "constraint satisfaction". I think this concept is crucial for understanding human intelligence; but it's not used very widely. So I'm curious where you picked it up.

I agree with your conclusion on the alignment section: these seem like low-resolution ideas that seem worth fleshing out.

Good job putting this out there without obsessively polishing it. That shares at least some of your ideas with the rest of us, so we can build on them in parallel with you polishing your understanding and your presentation.

Thanks a lot for the encouragement :)

Yes, I am trying to understand a generalized (which also means simplified) and formalizable parallel to human cognition. Some of my thinking on this is inspired by predictive coding and adaptive resonance theory (although prettly loosely by the latter), and I am trying to figure out the implications of our most updated understanding of neurobiological principles, together with a notion of the "riverbeds of cognition". 

In other words, how can we design an architecture such that it is not pressured to take shortcuts or "work around" design decisions we made, as its cognition develops? Is there a "natural path" of cognitive development that avoids some of the common pitfalls and failure modes (i.e. can we aim inner alignment if we have proficiency in this area)?
This has a direct bearing on interpretability, and goes together with the goal of a sort of "conceptual curriculum" that is intended to teach the system natural abstractions.

If I remember correctly, the centrality of "constraint satisfaction" fell out of considering causal (hyper/meta)graphs as sensible representational substrate (which was partially inspired by Ben Goertzel). I personally find it quite intuitive to think in graphs.


In my case introspection lead me to the realisation that human reasoning consists to a large degree out of two interlocking parts: Finding constraint of the solution space and constraint satisfaction. 

Which has the interesting corollary that AI systems that reach human or superhuman performance by adding search to NNs are not really implementing reasoning but rather brute-forcing it. 

It also makes me sceptical that LLMs+search will be AGI. 

I agree with all of that. Even being sceptical that LLMs plus search will reach AGI. The lack of constraint satisfaction as the human brain does it could be a real stumbling block.

But LLMs have copied a good bit of our reasoning and therefore our semantic search. So they can do something like constraint satisfaction.

Put the constraints into a query, and the answer will satisfy those constraints. The process used is different than a human brain, but for every problem I can think of, the results are the same.

Now, that's partly because every problem I can think of is one I've already seen solved. But my ability to do truly novel problem solving is rarely used and pretty limitted. So I'm not sure the LLM can't do just as good a job if it had a scaffolded script to explore its knowledge base from a few different angles.