Hi everyone, we are working on a project for the AI Safety Camp (AISC) 2023, and this is the first progress report. 

To summarize, the aim of the project is to begin exploring an alternative pathway to interpreting generally intelligent computational agents. We want to study neurosymbolic systems as an inherently more interpretable paradigm compared to the current deep learning architectures, and hope to resolve some of the limitations of neurosymbolic systems, develop new conceptual insights/frameworks in this context, as well as transport relevant insights from them back into the deep learning context. The project is very open-ended.

Neurosymbolic architectures can provide a rich research interface for studying various hypotheses about cognition, as well as alignment. The reason for this is generally that the effect of precise changes to the architecture’s modules or their content can be tracked much more clearly compared to within entirely subsymbolic systems. Learning occurs more locally, as opposed to a sweeping change through the entire architecture, and the pathways of decision making and modelling generally include multiple modules that can be studied in isolation. There is often greater clarity about which cognitive behavior is upstream of some given cognitive behavior, allowing a better grasp on the range of behavior for the given area. 

Cognitive Architectures are a diverse field and it is difficult to make general statements about it. Our project is focusing on a novel cognitive architecture, the Prop-room and Stage Cognitive Architecture (PSCA), that was proposed by the research lead Robert Kralisch, to allow for a more open-ended and thoroughly understood framework in which up-to-date insights from neuroscience and cognitive science more generally can be integrated or discarded as the development progresses. An additional reason for proposing a novel architecture is to better construct and isolate particular dynamics that we believe may be important to understand for solving the Alignment Problem. 

To be clear, this does not mean that we think that this approach will work for certain. However, it seems like an important thing to try.

In this Progress Report, we give a brief overview of what everyone has been working on over the course of the last month, and give some reflection on the pace and shape of the project progress. There will be a section for each member of the research team.


Robert Kralisch (RL)

Focused on: project management, sharing relevant conceptual framework and ideas in the form of presentations, 1on1 guidance for open-ended conceptual explorations, availability for any and all questions regarding the project, PSCA specifications and literature/resource evaluations. 


Project Management

This is my first time leading a research project, and I have opted for a general project shape of divergence -> convergence. I put a lot of value on semi-independent and open-ended investigations of project-relevant questions, and I try to facilitate a research environment that supports such a style. I offer many 1on1 meetings, I question people about their approach and why they think it is useful, and I try to share ways of thinking that I consider fruitful, while hopefully avoiding to pressure my team members into them. I have a clear vision on the components and shape of the project, but I am open as to which components will be strongly elaborated on by the team members.

I see the time at the AISC as a precursor for the long term ambition of the project, which is to facilitate a research cycle wherein conceptual/theoretical insights and questions inform the neurosymbolic architecture design and lead to clear predictions about the architecture’s behavior/performance, which are then empirically tested. The results of the tests inform the predictive theory, leading to either a revision of (parts of) the theory or the next iteration step in scaling the architecture and a new range of relevant predictions/questions. By iteratively developing an increasingly complex architecture in this way, I hope that we can study complex cognitive phenomena more clearly than has been previously achieved. The ambitious goal within the AISC is to bring the project to a point where most of the prerequisites are in place for this cycle to be started.

To this end, I encourage my team to study phenomena like preferences, abstractions and embedded agency with the intent to concretise and define them within a computational context, such that they can be implemented and/or tested with the PSCA. I have been giving presentations about my understanding of some conceptual fundamentals, covering topics like Computation, Intelligence and Agency. The goal is that the conceptual work will be grounded in first principles, further informed by interdisciplinary comparison, and added to a computational and predictive theory about designing and testing the PSCA.


During the first four weeks of the project, the concept of abstractions has emerged as a major discussion point and prerequisite concept for many questions of interest within the team. I was quite pleased with this development and the ensuing discussions, as well as the ability of the team members to simultaneously pursue other strands of research that went in different directions depending on the individual interest. It was not a problem to make sure that these research paths/questions remain sufficiently related to the project.

I do have to note, however, that the pace of our progress is only at about 75% of what I had been aiming for, for the first month. I mainly attribute this to some unfortunate health issues on my end, abnormally high blood pressure, that limited my ability to frequently check in with everybody’s progress. I expect that the related issues during the next month will be minor in comparison, and that, with the team having settled into a more open-ended project structure, we can increase the pace to a satisfactory level.


On Competence and its relationship to Intelligence

Below is an essay within which I tried to concisely explain how I believe Competence and Intelligence to be related. It is meant as an elaboration on the ideas about Intelligence in the project proposal, but can be read without consulting it. As with most parts of this progress report, it is WIP.


I would measure/define Competence as the ability to achieve goals in an environment. General Competence denotes this for a wide range of environments, in order to make the notion of Competence less contextual to a specific environment. One could have comparative or absolute notions of (General) Competence. 

”Achieving goals in an environment” translates to achieving a subset of possible states of a given environment. We usually think of these states as not being the default ones, so that the agent has to take an active role in bringing about the subset. 

Taking an active role in bringing about a subset of states of an environment(/ a system) means that the agent needs to interface with the environment and introduce changes away from the “default causal trajectory”. This “interfacing” is done through actuation, meaning that the agent makes changes to the environment through the boundary of its embodiment. One could imagine this actuation interface for every possible state of the environment (including the agent), offering different access to the substates of the environment, depending on many factors like e.g. the agent’s location and height.

In principle, this actuation interface could be so complete that the agent has direct control over every subcomponent(/parameter) of the environment, making the issue of achieving goals so trivial that the agent does not even require perception. It could just copy-paste a desired state through the interface. This is a special case where no perception is required to be maximally competent with respect to the environment. 

Usually however, perception is required and can be seen as part of the interface on the boundary between agent and environment. It provides evidence about the current state of the environment. It can potentially inform the agent about which part of the environment can be altered through actuation and in what way. 

This is relevant for all non-trivial implementations of cognition. I conceive of cognition as the translation between perception and actuation, but with the optional addition of perception about the agent’s internal state and actuation on it - as part of the environment, the agent may also be interfacing with itself. The performance of the cognition is dependent on the “translation algorithm” and the computational substrate that it runs on. 

I like to think of “efficient translation software” as Understanding, and I think that, in the embedded agency setting, it generally cashes out as a composition of computational models that capture predictions about relevant sub-states of the environment and map them onto actions. 

Now, Intelligence is the algorithm/mechanism that updates and creates Understanding. It potentially makes the “translation software” more efficient, more effective, or adjusts it to new contexts (which I felt was worth mentioning separately, though it can be mapped onto increased efficiency/effectiveness).


Taking back a step, it should be apparent that multiple factors interplay to determine an agent’s Competence. If we lower the actuation capability of the agent, it potentially needs to find a cleverer way to achieve its goal, or maybe the goal has become principally untouchable from the new interface. If we turn any of these dials of perception, actuation, cognitive hardware and cognitive software, or the way in which these all connect, we can alter the agent’s (General) Competence. 

Intelligence is the mechanism by which a functional translation algorithm is developed over time. If the hardware capacity suddenly doubles, one will need Intelligence to make use of that. Similar for any other turn of the dial. If one eye goes blind, Intelligence will need to adjust the Understanding to account for that. 

And yet, Intelligence can be superfluous if all these components are already correctly calibrated and connected, and the environment does not receive any unusual disturbance. In this case, Intelligence would not even register as a factor in the agent’s Competence. It is an indirect contributor.

An agent could self-improve in the absence of Intelligence, if its Understanding already encodes behavior that would lead to self-improvements. This is also why the boundary between Intelligence and Understanding is somewhat ambiguous. 

Intelligence is also just a sort of translation algorithm, but the environment that it actuates within is the cognition of the agent. And in this environment, it can be extremely competent - not least due to its potent interface.  


Anton Zheltoukhov (TC)

Focused on: Project/Team Coordination, bridging conceptual and implementation work, translating insights from other Cognitive Architectures and previous independent work to the framework of the PSCA. Currently working on project coordination and identifying sources with significant overlap to the project.

Why interpretable architectures

  • Project subgoals. Interpretability / Cognitive Architectures (CAs) 
    • To transfer insights from CAs to mainstream ML.
    • research [belief / knowledge] update mechanisms of values and concepts
    • To eventually replace/complement mainstream ml solutions to achieve greater interpretability
    • Determine good language/vocabulary that covers/connects the pathways towards interpretability
    • Advertise cognitive architectures as viable tool for ai safety and capacity
  • Project subgoals. positive attractors
    • introduce the notion of “positive attractors”, connect to previous work, supply examples
  • Pathways towards interpretability
    • Low level - bottom up: hard core mechanistic interpretability. dissect ml architectures and get to know how they tic
    • Higher level - starting from an alternative layer of resolution and going up/down from there: e.g. semiotic physics. studies dynamics inside simulations
    • Highest level - top down: directly design and build inherently interpretable architectures. problem is with scaling them
  • Old (symbolic) Cognitive Architectures failed because of numerous reasons that were often unique to the CA, but were generally about scaling difficulties. 
  • Why create a new CA? Previous ones are also interpretable.
    • Yes, existing CAs would potentially do, if updated
    • The main reasons for creating a new one
      • To focus on specific part the dynamics that interest us (like formation of abstractions and preference generators)
      • To start out with something more blank slate that can integrate new insights about cognition without any transported constraints
  • Requirement for scalability?
    • Main Purpose of the project is not to work towards a competitive architecture
    • However, a certain level of competitiveness is required to show transferability of insights and to study interesting cognitive behavior
    • At the very least, the CA should not require much manual labor


Research Priorities

  • Expand on the idea of positive attractors. Describe possible dynamics leading to them. What entities are involved? How are they distributed between [CA algorithm / hardcoded rules / learned [behaviour / data] / training [env, data]]?
    • failure mode → inverse them → %magic that relies on common sense% →
    • meta-preference example. Have a meta-preference to generally avoid reward hacking / specification gaming? 
  • List the most useful CAs for our purposes (inspiration/comparison). Ideally with an available codebase.
  • Specify possible experiments.
    • e.g. kill the dragon in Minecraft. Modern Minecraft is too complicated. A later version of the project could use some old simpler version for testing
    • The smallest experiment we can do right now. Say, I want to code something this weekend, what would it look like?
      • simplest prop room implementation + grid world + minimal number of objects + simple objective
      • Could be a labyrinth navigation task, perhaps three different labyrinths that have different rules, to test context capture
  • Settle down with personal research lines for each member of the team. Share them between team members
  • Two way approach. 
    • Forward part: from CAs to ML 
    • Backward part: from ML to CAs
  • Research model based RL
  • Conceptual work on agency, intelligence, learning
  • Overall shape of the CA part of the project. constraints. boundaries. fixed design decisions.
    • Embedded agency perspective for designing CA
    • Belief updates within the model should be Baysian?
    • Decision making module should be symbolic
      • How is interpretability achieved? data model of said hypergraph. Log the decision making process? rule-based approaches?
        • Symbolic = interpretable
    • Blank slate vs initial knowledge. Some initial knowledge related to the experiment
      • We need to experiment with this
      • E.g. prespecified list of objects / content in the prop-room. abstraction hierarchy is given.
      • Blank slate option can be equivalent if the training runs are designed to elicit certain structures/content in the prop-room
    • Online learning. update of behavior defining data. online values evolution.
      • Learning can be turned off and on
      • Enabling lifelong-learning is not a priority of the project for now
    • ability to learn new embodiments
      • Also not a priority for now, but should be kept in mind for design decisions
    • Eventually aimed to scale close to human level, but that’s very out of scope for this current project


Sohaib Imran

Focused on: Building a prototype for the Prop-room and Stage Cognitive Architecture (PSCA), as well as working out directly related concepts. Currently working on abstractions.

Note: * indicates areas where I have high uncertainty and feedback would be especially valuable.

Understanding abstractions

 johnswentworth: ‘Useful’ Information at a distance.

Often abstractions take the form of sets. For instance, the abstraction ‘fruit’ can be considered a set and so can the abstraction ‘red’. However this seems to not always be the case. The abstraction ‘speed’, although applying to the set of moving things, is difficult to think of as a set itself*. The underlying reason may be that these are two different types of abstractions. The abstraction ‘speed’ is more akin to the abstraction ‘colour’ (than the abstraction ‘red’), where these are variables that can take multiple values. Mao et al. (2019) take a similar approach when they define ‘attributes’, such as shape and colour, as neural operators that map object embeddings to ‘concepts’, such as ‘circle’ and ‘red’ in attribute-specific embedding space. However there are some abstractions that will be both variables and values. For eg. the abstraction ‘car’ can be a value of the variable ‘object’ as well as a variable with value ’tesla model 3’.


Modelling Abstractions

The project originally started with a model of abstractions as edges of a hypergraph. This ties perfectly with the understanding of abstractions as variables. Moreover, these hypergraph edges could themselves be nodes in other edges of the hypergraph, allowing abstractions to be reasoned as both variables and values. However, hypergraph edges are only defined as a set of nodes and do not offer a method for interpolation, which is necessary when encountering never seen before objects. To solve this, another model that has been proposed is abstractions as regions of space.

The exact specifics of how different types of abstractions will be represented in space is still being discussed and refined. The current understanding is that variables are dimensions of a space and (discreet) values of those variables clusters within those dimensions. Higher level abstractions that are composed from values of multiple variables are then clusters over multiple dimensions. (as clusters start to form over multiple dimensions, at some point the agent may decide to treat them as a value of a new variable, and therefore add a new dimension*). This would require training the perception module to embed the current state in terms of the abstractions the agent has learnt.


Why abstractions?

Humans have learned to reason about the world in multiple levels of granularity, owing to the contrast between the limited information our brains can process at a time and the complexity of our environments. Therefore our world models are composed of lots of abstractions that we have learnt. Artificial agents, specifically embedded agents, will also encounter this difficulty due to being less complex than their environments, and will also form good abstractions as a useful instrumental goal. If we are to understand an agents world model, we would require their abstractions to be similar to ours*. This may naturally be the case if the NAH is true* (I don't really understand NAH and have only just started reading about it). Considering the above, one reason for the focus on abstractions in this research is to better understand and test the NAH.


David Liu

Focused on: Conceptual Research on neurosymbolic Cognitive Architectures, with an emphasis on positive attractors. Currently working on understanding preferences generally.



preference is something that is desired over other things. If an agent is given three possible choices {A,B,C} and chooses B, then B has some quantitative/qualitative value over A and C w.r.t. the agent, in the context of the given decision making scenario.

A more concrete example is the statement “I want a banana”, which expresses a preference for a banana.

The current agenda explored preferences under the reinforcement learning (RL) paradigm, where choices are made based on expected reward.

For the alignment problem, we want AIs to make good choices, and a major question is figuring out how this hypothetical decision-making process would work.



One way to guide decision-making is through meta-preferences. A meta-preference is a preference that is desired over other preferences. The statements “I want to want a banana” and “I want to have a desire to quit smoking” are examples of meta-preferences.

We can think of meta-preferences as strategies for picking or changing preferences (which means preferences can be thought of as strategies for making choices). If an agent wants to have a desire to quit smoking, they will place more value on preferences that express the idea “I desire to quit smoking”, which can be statements like “I don’t want cigarettes”, “I don’t want to go to the smoking lounge”, “I want to smoke less”, etc.

Meta-preferences place “extra pressure” towards certain preferences, but this claim generally applies to pure consequentialists, a form of reasoning that isn’t very human-like.

As an example, it’s often the case for humans to hold inconsistent preferences, particularly in the context of chemical dependencies. A smoker could raise a desire for cigarettes, yet retain a meta-preference where they don’t want such a desire. Preferences that offer high short-term rewards may appear very attractive, even when the reward is expected to diminish in the long run.

I doubt we’ll expect future AIs to form chemical dependencies as humans do, but undesirable decision-making can take shape in other forms (e.g. reward hacking).

It’s also not entirely clear how AIs will model preferences and meta-preferences, or if meta-preferences will truly influence preference-making strategies—these problems remain to be explored.


Positive attractors

negative attractor is a set of states that lead to failure modes (e.g. wireheading, perverse instantiation, deceptive behavior).

On the other hand, a positive attractor is a set of states that build resistance against failure modes. Positive attractors do not necessarily lead to “success modes”; rather, they behave more like threat-minimizers. Positive attractors are a vaccine for unaligned behaviors, rather than an antidote.

Positive attractors are worth studying because they may allow the AI to perform most of the heavy lifting for alignment (cognitive offloading).

Meta-preferences can be thought of as positive attractors. If an agent has the ability to reason with meta-preferences (i.e. forming broader strategies for decision-making), they may become less susceptible to tracking narrow, imprecise goals and choices.

Other possible types of positive attractors include self-reflection, the ability for an agent to assess their internal models, and imprinting, the ability for an agent to conduct phase learning.


Johnnie Pascalidis

Focused on: Conceptual Research from a neuroscience angle, questioning the project and trying to find gaps in our reasoning. Currently working on understanding abstractions.


Abstraction within Neuroscience


  • To better understand what abstraction is.
  • To better understand the architectural structure of the brain when dealing with abstractions.

A mini-taxonomy of different kinds of measurable abstractions:

  • Policy abstraction:
    • A more abstract rule describes which of a set of simpler rules to follow depending on particular circumstances (rules about rules).
    • REF: (Badre et al., 2010)
  • Temporal abstraction:
    • Higher temporal abstraction is when information that is used in a decision comes from longer ago.
    • REF: (Nee et al., 2014)
  • Relational abstraction
    • High relational abstraction would be processing the relations between objects, rather than the relation of an object to a specific feature metric (e.g., ‘is car A big?’- abstraction level 1; ‘is car A bigger than car B?’- abstraction level 2).
    • (unsure about how correct my understanding is)
    • REF: (Nee et al., 2014)
  • Domain generality
    • A brain region with more domain-general activation patterns is more abstract than one with more domain-specific activation patterns.


To integrate these, perhaps these abstractions all refer to the distance of the cognitive process from the immediate response to the environment.


My current understanding of why understanding abstraction is important:

Understanding this may mean we are better able to manipulate the highest levels of abstractions. Seeing this process as a decision-tree of increasing nested rules (i.e. rules about rules about rules…) means that decisions taken higher up the tree constrain the possible decision options at lower levels. The implications of this are: 

Firstly, that it weakens the argument about understanding lower levels of abstraction because decisions made upstream would mean that certain lower level decisions are no longer considered because a different route was chosen down the tree. 

Secondly, higher levels of abstraction bound the possible options further down the tree. Therefore, it is important to understand the highest level of abstraction of the decision-tree. If this could be manipulated then this would allow maximum control over the possible decisions made at any point further down in the decision tree. This would therefore be the highest impact place for a positive attractor superordinate rule.


Some subsequent questions:

How would we go about extracting higher levels of abstraction?

How rigid would this decision-making architecture be (and therefore how scalable/ generalisable)?

How would it be manipulated even if it was understood?

(My understanding is that the agent would train from data, so this wouldn’t be directly specified).

How would we specify a positive attractor rule in a way that runs?

Does the brain process abstractions in a sequence of linear decisions?

(Badre & Nee, 2018)

  • Yes:
    • Mid dorso-lateral prefrontal cortex has been argued to be the apex of the abstraction gradient. This means that there may be an end/ top of the decision tree.
  • No:
    • Lower stages in the hierarchy can influence higher stages as well as the reverse. 
  • Implication if No:
    • If lower steps in the decision tree can influence higher steps, then by controlling the higher steps we do not necessarily bound the possible range of lower level decisions because the higher stages can themselves be influenced.
      • This could just not be implemented, but then it might be necessary to perform flexible abstractions in a complex environment.


Current minor research strands

  • Interpretability:
    • What does it mean for something to be interpretable?
    • How do we measure interpretability?
    • What epistemological framework should/ could be used for us to be able to explain in more detail what it means to understand something.
    • It could be also interesting to review how interpretable human decisions are
  • Neuroscience of values/ morality
  • Machine behaviour:
    • E.g. ‘Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents’ https://arxiv.org/abs/1801.08116
      • Can perform psychological experiments in a virtual world on humans and artificial agents.
      • May be usable with cognitive architectures as well as with deep learning
New Comment