Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you'd be happier to see some kinds of systems than others.

For example, Paul Christiano suggests "LM agents are an unusually safe way to build powerful AI systems." He says "My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes."

My quick list is below; I'm interested in object-level suggestions, meta observations, reading recommendations, etc. I'm particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.

All else equal, it seems safer if an AI system:

  • Has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions);
  • Decomposes tasks into subtasks in comprehensible ways, and in particular if the interfaces between subagents performing subtasks are transparent and interpretable; 
  • Is more supervisable or amenable to AI oversight (what low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?);
  • Is feedforward-y rather than recurrent-y (because recurrent-y systems have hidden states? so this is part of interpretability/overseeability?);
  • Is myopic;
  • Lacks situational awareness;
  • Lacks various dangerous capabilities (coding, weapon-building, human-modeling, planning);
  • Is more corrigible (what lower-level desirable properties determine corrigibility? what determines whether systems have those properties?) (note to self: see 1, 2, 3, 4, and comments on 5);
  • Is legible and process-based;
  • Is composed of separable narrow tools;
  • Can't be run on general-purpose hardware.

These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both "more interpretable" and [whatever low-level features make systems more interpretable].

If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?

Obviously I'm mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it's far behind unsafe systems.

New Answer
New Comment

5 Answers sorted by

My improved (but still somewhat bad, pretty non-exhaustive, and in-progress) list:

  • Architecture/design
  • Data & training
    • [Avoid training on dangerous-capabilities-stuff and stuff likely to give situational awareness. Maybe avoid training on language model outputs.]
    • ?
  • Limiting unnecessary capabilities (depends on intended use)
    • The system is tool-y rather than agent-y
    • The system lacks situational awareness
    • The system is myopic
    • The system lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
    • The system lacks access to various dangerous tools/plugins/actuators
    • Context and hidden states are erased when not needed
  • Interpretability
    • The system's reasoning is externalized in natural language (and that externalized reasoning is faithful)
    • The system doesn't have hidden states
    • The system uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
  • Oversight (overlaps with interpretability)
    • The system is monitored by another AI system
    • The system has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions) (in particular, consequential actions require human approval)
    • The system decomposes tasks into subtasks in comprehensible ways, and the interfaces between subagents performing subtasks are transparent and interpretable
    • The system is more supervisable or amenable to AI oversight
      • What low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?
    • Humans review outputs, chain-of-thought/scratchpad/etc., and maybe inputs/context
  • Corrigibility
    • [The system doesn't create new agents/systems]
    • [Maybe satisficing/quantilization]
    • ?
  • Incident response
    • Model inputs and outputs are saved for review in case of an incident
    • [Maybe something about shutdown or kill switches; I don't know how that works]

(Some comments reference the original list, so rather than edit it I put my improved list here.)

More along these lines (e.g. sorts of things that might improve safety of a near-human-level assistant AI):

  • Architecture/design:
    • The system uses models trained with gradient descent as non-agentic pieces only, and combines them using classical AI.
    • Models trained with gradient descent are trained on closed-ended tasks only (e.g. next token prediction in a past dataset)
    • The system takes advantage of mild optimization.
      • Learning human reasoning patterns counts as mild optimization.
      • If recursion or amplification is used to improve results, we might want more formal m
... (read more)

I though I was going to have more ideas, but after sitting around a bit I'll just post what I had in terms of friendliness architectural/design choices that are disjoint from safety choices:

  • Moral and procedural self-reflection:
    • Human feedback / human in the loop
    • System should do active inference / actively help humans give good feedback.
    • Situational awareness to the extent that the situation at hand is relevant to self-reflection.
    • Bootstrapping:
      • The system should start out safe, with a broad knowledge about humans and ability to implement human feedback.
      • The "Broad knowledge yet safe" part definitely incentivizes language models being involved somewhere.
      • Implementing feedback, however, means more dangerous technology such as automated interpretability or a non-transformer architecture that's good at within-lifetime learning.
  • Moral pluralism / conservatism
    • The central idea is to avoid (or at least be averse to) parts of state space where the learned notions of value are worse abstractions / where different ways of generalizing human values wildly diverge.
    • Requires specifying goals not in terms of a value function (or similar), but in terms of a recipe for learning about the world that eventually spits out a value function (or similar).
    • Easier to do if the outer layer of the system is classical AI, though we probably shouldn't rule out options.
  • Dataset / Human-side design:
    • Broad sample of humanity is able to give feedback.
    • Transparency and accountability of the feedback process is important.
  •  Capabilities useful for friendliness and problems for safety:
    • Situational awareness
    • Independent planning and action when needed (some amount of consequentialist evaluation of soliciting human feedback/oversight).
    • Generality
    • Within-lifetime learning

Maybe relatively safe if:

  • Not too big
  • No self-improvement
  • No continual learning
  • Curated training data, no throwing everything into the cauldron
  • No access to raw data from the environment
  • Not curious or novelty-seeking
  • Not trying to maximize or minimize anything or push anything to the limit
  • Not capable enough for catastrophic misuse by humans

Someone anonymously suggests:

  • Keep pretraining closely matching the human text distribution
  • Require all rumination to be done in natural language
  • Require approval before taking consequential actions
  • Only deploy a model with a second adversarial monitor
  • Only deploy a model if it's "dumb" in a quantifiable way

Another factor for the safest type of AGI is one that can practically be built soon.

The perfect is the enemy of the good. A perfectly safe system that will be deployed five years after the first self-improving AGI is probably useless.

Of course the safest path is to never build an agentic AGI. But that seems unlikely.

This criteria is another argument for language model agents. I've outlined their list of safety advantages here.

Of course, we don't know if language model agents will achieve full AGI. 

Another path to AGI that seems both achievable and alignable is loosely brainlike AGI, along the lines of LeCun's proposed H-JEPA. Steve Byrnes' "plan for mediocre alignment" seems extensible to become quite a good plan for this type of AGI.  

14 comments, sorted by Click to highlight new comments since: Today at 5:57 PM

A note on this part:

If its true thoughts are transparent and expressed in natural language...

Claim: it is impossible for the internal thoughts of a mind to both be expressed mostly in natural language, and to use that natural language in a way at all similar to humans (for instance, no steganography). The reason is that, the way humans use natural language in practice, the words are largely used to "pull mental levers", and then most of the load-bearing reasoning happens in the non-linguistic cognition those levers induce. So if language is used the way humans use it, then most of the cognition is hidden away in non-linguistic channels.

You can see this in practice every time someone has trouble expressing their thoughts in words. Or every time someone is able to express their thoughts in words to someone who already has a bunch of shared mental models (even without necessarily shared jargon to point to those models - e.g. one can say "you know that thing where..." and give a couple examples), but is unable to express their thoughts in words to someone who doesn't already have the relevant mental models.

The closest unblocked requirement would be an AI with lots of parts separated by internal interfaces, where those interfaces use natural language in a human-like way. There's still a lot of non-natural-language cognition going on in between the natural language in and the natural language out, but the interfaces might still provide a lot of useful visibility of intermediates. Basically-all of today's LLMs would be examples of such an architecture: they take natural language in, do some magic, then spit out one more token of natural language.

General comment on this list: the mental image behind most of the items seems to be roughly "break the system into parts, which are individually interpretable and nonmalicious". Intermediates expressed in natural language, decomposition into interpretable subtasks, myopia, minimal hidden state, lack of situational awareness, legibility, process-based, and composition of narrow tools all tie into that general pattern.

These all share similar shortcomings: interpretability/tool-ness/corrigibility/etc are not composable, and also (depending on which versions of these ideas one imagines) the not-yet-well-written-up problem of "you don't get to choose the ontology".

That's not really a problem for any of these properties individually - they're each still potentially worthwhile properties which make AI relatively safer. But in aggregate, bear in mind that there's decreasing marginal returns to properties which all pursue roughly-similar upsides and have roughly-similar shortcomings. If this list is your starting point, then there's probably a lot more marginal value in adding properties which which address e.g. the composability problem, rather than additional similar properties to those listed.

I think narrow domain might be one of the most important properties. An AI that is very good at designing drugs and nothing else is much safer than an AGI; it doesn't understand itself or the world at large and can't go on a self improvement rampage. It still has risks but they're ordinary "don't let a radical political terrorist in a BSL4 lab" kinda risks.

Yo Shavit says:

I wish we talked more about which particular AI systems design principles give us confidence in safety.

For example: context is always erased; humans review ~all context and output tokens.

These principles are likely to disappear unless we center them in our analysis & demands.

Less worried rn about issues like "RLHF incentivizes deception"

Much more worried about "a pair of mostly-aligned AI systems talk to each other for 3 hours and then make POST requests, and no human actually reviews the full transcript"

I think it's good this post exists. But I really want to make the distinction between "safe" and "is a solution to the alignment problem," which this post elides. Or maybe "safe" vs. "friendly"?

If we build a superhuman AGI we'd better have solved the alignment problem in the sense of actually making that AGI want to do good things and not bad things. (Past a certain point, "just follow orders" isn't safe unless the order "do good things" works. If it wouldn't work, you've built an unsafe AI, and if it would work, you might as well give it.)

OpenAI's "parallelizable alignment assistant" strategy can work for between 0 and 4 organizations in the world, because it relies on having enough of a lead that you can build something that is safe yet not a solution to the alignment problem, and nobody else will cause an accident in the weeks or months you spend trying to convert this into a solution to the alignment problem.

To look at one example property: taking a random AI and putting a human in the loop makes it more safe. But it does little to nothing for solving alignment. It helps when you're building an AI that's dumber than you, but doesn't really when you're building an AI that's smarter than you.

Or lack of situational awareness. This is actively anti-alignment, because the state of the world is useful information for doing good things. But it's even more anti-capabilities, so it's a fine property to shoot for if you're making an AI that's safe because it has limited capabilities.

I'll come back later with a comment that actually makes suggestions, both ones that trade off for safety and for friendliness.

Well said. I mostly agree, but I'll note that safety-without-friendliness is good as a non-ultimate goal.

Re human in the loop, I mostly agree. Re situational awareness, I mostly agree and I'll add that lack-of-situational-awareness is sometimes a good way to deprive a system of capabilities not relevant to the task it's designed for-- "capabilities" isn't monolithic.

I think my list is largely bad. I think central examples of good-ideas include LM agents and process-based systems. (Maybe because they're more fundamental / architecture-y? Maybe because they're more concrete?)

Looking forward to your future-comment-with-suggestions.

If its true thoughts are transparent and expressed in natural language(see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)

This seems technically true but a bit of a trap, since it may be easier to get ‘looks like it expresses its thoughts in natural language’ than ‘reliably actually does’ and specifying the difference may be too subtle for people. 

what lower-level desirable properties determine corrigibility?
 

 

Has corrigible (or alignment) properties embedded in the attention weights.

Here is a comparative analysis of a project I'm using datasets to instruct / hack / sanitize the whole attention mechanism of GPT2-xl in my experiments: A spreadsheet on QKV Mean weights comparisons on various GPT2-xl builds. The spreadsheet currently has four builds, and the numbers you see is the mean weights (half of the attention mechanism in layers 1 to 48, doesn't include the embedding layer):

or the conceptual architecture of corrigibility is a part of the attention mechanism...

All else equal, I think minimizing model entropy is desirable (i.e. the number of weights). In other words, you want to keep the size of the model class small.

Roughly, alignment could be viewed as constructing a list of constraints or criteria that a model must satisfy in order to be considered safe. As the size of the model class grows, more models will satisfy any particular constraint. The complexity of the constraints likely needs to grow along with the complexity of the model class.

If a large number of models satisfy all the constraints, there is a large amount of behavior that is unconstrained and unaccounted for. We've decided that we don't care about any of the behavioral differences between the models that satisfy all the constraints.

This isn't necessarily true. Modern DL models are semi-organically grown rather than engineered, so the set of SGD discoverable models is much smaller than the set of all possible models. And techniques like iterative amplification further shrink the set of learnable models. Or maybe many of the models are behaviorally identical on the subset of inputs we care about.

That said, thinking about model entropy seems helpful.

Hmm, I haven't heard of this kind of thing before; what should I read to learn more?

I've not really seen it written up, but it's conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.

After some searching, the most relevant thing I've found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.

Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?