CAIS-inspired approach towards safer and more interpretable AGIs

Peter Hroššo

CAIS-inspired approach towards safer and more interpretable AGIs — LessWrong

13 CAIS-inspired approach towards safer and more interpretable AGIs

by Peter Hroššo

27th Mar 2023

1 min read

13

Epistemic status: a rough sketch of an idea

Current LLMs are huge and opaque. Our interpretability techniques are not adequate. Current LLMs are not likely to run hidden dangerous optimization processes. But larger ones may.

Let's cap the model size at the currently biggest models, ban everything above. Let's not build superhuman level LLMs. Let's build human level specialist LLMs and allow them to communicate with each other via natural language. Natural language is more interpretable than the inner processes of large transformers. Together, the specialized LLMs will form a meta-organism which may become superhuman, but it will be more interpretable and corrigible, as we'll be able to intervene on the messages between them.

Of course, model parameter efficiency may increase in the future (as it happened with Chinchilla) -> we should monitor this and potentially lower the cap. On the other hand, our mechanistic interpretability techniques may improve, so we may increase the cap, if we are confident it won't do harm.

This idea seems almost trivial to me, but I haven't seen it discussed anywhere, so I'm posting it early to gather feedback why this might not work.

AI GovernanceAI Services (CAIS)Chain-of-Thought AlignmentLanguage Models (LLMs)

Frontpage

13

CAIS-inspired approach towards safer and more interpretable AGIs

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:10 AM

[-]Niki Dupuis3y60

Natural language is more interpretable than the inner processes of large transformers.

There's certainly something here, but it's tricky because this implicitly assumes that the transformer is using natural language in the same way that a human is. I highly recommend these posts if you haven't read them already:

[-]Peter Hroššo3y20

Regarding steganography - there is the natural constraint, that the payload (hidden message) must be relatively small with respect to the main message. So this is a natural bottleneck for communication which should give us a fair advantage over the inscrutable information flows in current large models.

On top of that, it seems viable to monitor cases where a so far benevolent LLM receives a seemingly benevolent message, after which it starts acting maliciously.

I think the main argument behind my proposal is that if we limit the domains a particular LLM is trained on, there will be fewer emergent capabilities. Ie. a computer-science specialist may come up with steganographic messaging, but it it will be hard to spread this skill/knowledge to specialists in other domains such as biology, chemistry, humanities... And these other specialists won't be able to come up with it by themselves. They might be able to come up with other dangerous things such as bioweapons, but they won't be able to use them against us without coordination and without secure communication, etc.

[-]Peter Hroššo3y10

Thanks for the links, will check it out!

I'm aware this proposal doesn't address deception, or side-channels communication such as steganography. But being able to understand at least the 1st level of the message, as opposed to the current state of understanding almost nothing from the weights and activations, seems like a major improvement for me.

[-]Brendon_Wong3y30

Have you seen Seth Herd's work and the work it references (particularly natural language alignment)? Drexler also has an updated proposal called Open Agencies, which seems to be an updated version of his original CAIS research. It seems like Davidad is working on a complex implementation of open agencies. I will likely work on a significantly simpler implementation. I don't think any of these designs explicitly propose capping LLMs though, given that they're non-agentic, transient, etc. by design and thus seem far less risky than agentic models. The proposals mostly focus on avoiding riskier models that are agentic, persistent, etc.

[-]PeterMcCluskey3y31

The main effect might be reduced interpretability due to more superpositioning?

[-]Teun van der Weij3y20

I think your policy suggestion is reasonable.

However, implementing and executing this might be hard: what exactly is an LLM? Does a slight variation on the GPT architecture count as well? How are you going to punish law violators?

How do you account for other worries? For example, like PeterMcCluskey points out, this policy might lead to reduced interpretability due to more superposition.

Policy seems hard to do at times, but others with more AI governance experience might provide more valuable insight than I can.

[-]Lucius Bushnaq3y10

Seems like a slight variant on MIRI's visible thoughts project?

Moderation Log