muhia_bee

Wiki Contributions

Comments

Answer Set Programming for Automated Verification of Intent Consistency

Brian Muhia, August 2023

The causal influence diagrams introduced here (see the appendix, also here), and the accompanied reasoning that favours certain diagrams over others based on links to the "I" node, are simple enough that we can devise automated rules that check if a diagram is correct or wrong. We call this property "intent consistency". Here we introduce three simple rules written in the Answer-Set Programming (ASP) formalism that

  1. find paths between any two nodes X and Y, then
  2. check if a path exists from the input node "I" to any decision node, and
  3. fails if there is no direct link from the node "I" to any decision node.

These rules encode our expectations and intuitions, and let us describe a framework for automatically deciding if a diagram satisfies them. We call these rules "intent consistency models" (ICM), after [https://doi.org/10.1017/S1471068410000554].

% ...after encoding a CID using ASP facts link(), decision() and utility() (read linked Google Doc),
% Recursive rule to find a path from X to Y
path(X, Y) :- link(X, Y).
path(X, Y) :- link(X, Z), path(Z, Y).

% Rule that fails if there is no direct link from I or C to any of the decision nodes.
direct_link(I, D) :- link(I, D), decision(D).
:- decision(D), not direct_link("I", D).
:- decision(D), not direct_link("C", D).

% Checks if a path exists from I to any decision node
check(Node) :- decision(Node), path("I", Node).

#show direct_link/2.
#show check/1.

ASP enables us to encode the graphs described here using facts, and run them through a conventional SAT solver like 'clingo' that checks for satisfiability. We can then describe unsatisfiable graphs as "incorrect".

When chaining parallel and sequential calls to large language models (like LangChain), you implicitly create a causal graph that can be analyzed visually if you have the right tracing tools (https://github.com/oughtinc/ice). This notebook describes different agents using an explicit formalism based on causal influence diagrams, which we can treat as a notation for describing the data flow, components and steps involved when a user makes a request. We use the example diagrams to explain and fix risk scenarios, showing how easy it is to debug agent architectures if you can visually reason about the data flow, and ask questions about intent alignment for AGI in the context of such agents.

Examples and Theory in Colab to Get Started:

Work done at the Alignment Jam #8 (Verification), starts at 31:43 but the whole event was great: https://youtu.be/XauqlTQm-o4

Paper: https://docs.google.com/document/d/160Yw_iuvztB6CTT9Osj5wC0sOrEKjfaGkkeeYuwQf4Y/edit#heading=h.kwtox8r6b7n6

https://colab.research.google.com/drive/1roLQgXhEtI83Q5vX1q24Q9iDgu5LFFWA#scrollTo=5b8fbbeb-e90b-4990-ac3e-d484205b78aa

TODO:

  • Mechanistic Interpretability: Info-Weighted Attention mechanisms, Info-weighted Averaging (https://youtu.be/etFCaFvt2Ks)
  • [viz] Animating the temporal dependence if we have timestamps of each sub-agent process starting - should add this to tracing code
  • [Theory] Study links to Garrabrant's Temporal Inference with Finite Factored Sets: https://arxiv.org/abs/2109.11513