Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1]

— Epic of Gilgamesh - Tablet X

 

Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.

Here’s one safety layer one could implement:

  1. Train a scientist AI on a large scientific corpus translated exclusively into Sumerian. Keep it in a secure containment environment.
  2. Train a less-smart reporter whose sole ability is to translate from Sumerian to English only if the Sumerian content is about medical research. It refuses to translate other kinds of content.
  3. Human operators are only allowed to interact with the scientist AI through the intermediate of the reporter.

This safety layer helps in at least two ways:

  1. The scientist AI cannot directly manipulate humans to escape its containment. It should first trick the reporter to then trick the humans, rendering the effort still possible but less practical.
  2. If the scientist AI exfiltrates, it’ll have a harder time manipulating humans online and will leave traces: e.g. any occurrence of Sumerian on the Internet outside of ancient text studies becomes highly suspicious. One can even imagine recognizing the idiosyncrasies of English coming from translated Sumerian, or the artifacts left by the Sumerian-specific tokenizer reused to produce non-Sumerian text.

Why Sumerian[2]

  • It has been extinct for 4000 years. There are very few (probably <1000) experts who fluently understand Sumerian.
  • It is a language isolate. It’s unlike any existing spoken language, rendering its identification in case of a leak much easier.
  • There is a substantial corpus. Despite its age, a significant number of Sumerian texts have been discovered and preserved. These include religious texts, legal codes, literature (like the Epic of Gilgamesh, in which parts are written in Sumerian), and administrative records. The corpus might be enough to train high-quality translating systems from English and other high-resource languages.

How realistic is this? We think the project would require substantial engineering effort of a scale doable by the current AGI companies. A small-scale project fine-tuned a T5 model to translate 100k Sumerian to English with reasonable quality. This is evidence that translation in the other direction is doable. The resulting texts will probably not be fluent in Sumerian, but good enough to accurately describe the huge diversity of subjects contained in traditional LLM datasets. Even if there are too few Sumerian resources, companies could pick Latin or another ancient language, or even ask linguists to invent a language for the occasion.

What is this for? AI assistance seems important for many of the currently pursued agendas in top labs or upcoming labs (e.g. scalable oversight, alignment work by AI, creating a world simulation with AI expert programmers). Though there are cruxes for why none of these plans may work (e.g. that anything that can solve alignment is already too deadly), it’s still dignity that people who run these programs at least make strong efforts at safeguarding those systems and limit their downside risk. It would be a sign of good faith that they actually engage in highly effective boxing techniques (and all appropriate red teaming) for their most powerful AI systems as they get closer to human-level AGI (and stop before going beyond).

(Note that programs to use low-resource language such as Native American languages to obfuscate communication have been successfully used at a wide scale during the First and Second World Wars)

Principles for designing safety layers

We’re not actually vouching for this exact solution as is. However, we think it’s good food for thought. It draws attention to 3 points that we think are important to keep in mind when designing AI safety interventions:

  1. We have control over the training process. We create these machines. Even if it’s unclear how to make the best use of these advantages, this is a crucial difference with the scenario “hostile super-intelligent aliens are arriving on Earth” that sometimes AI safety turns into in our mind.
  2. Society is more robust to certain capacities than others. It is often advised that we should work on making society more resilient. This example shows we can take a more fine-grained approach: society is not equally resistant to every capacity. We can use the control we have over the training process such that AIs have capacities society is less fragile to — in this case removing the AI’s ability to use a widely known language as an API to interact with humans, hence diminishing drastically the manipulation attack surface. This could be seen as an application of the least privilege principle: only give the capabilities the AI needs to do its job.
  3. It’s sometimes useful to go from “solving alignment” to “finding a new cheese slice to stack”. Working on AI safety has often been synonymous with a glorious effort to come to an end to the alignment problem, in the same way that humanity came to an end with Fermat’s theorem. We think it can trap researchers by overlooking the complexity of the socio-technical problem that is AI safety. Instead, we think that thinking in terms of gradual safety improvement until we reduce risk under an acceptable threshold is more fruitful to come up with a solution that can be applied in the real world (for controlling risk from human-level AGI, not superintelligence).
  4. Many such ideas already exist, and many more can be invented. It matters to diversify the approaches we actually develop.  We encourage readers to take a moment to generate a few such ideas, and to consider working on these neglected ideas and do proofs of concept so they may more easily be integrated into plans.

 

  1. ^

     The end of the paragraph 1-15 corresponds to the quote according to this reddit post.

  2. ^

     Note that we are not Sumerian experts, we only spent ~ 2h researching Sumerian for this post. Feel free to point to our mistakes!

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 9:32 PM

I like this idea, but it seems much better in practice to instead use some obscure but otherwise modern language like Welsh. I think this gets you most of the benefits with minimal substantially reduced cost.

One issue with this idea is that understanding what the model is doing via CoT or natural language communication bottlenecks will become worse due to lossy-ness in translation. I think if the AIs do most of their reasoning in CoT, this consideration is probably dominant and we should just use english.

We could try to only apply the obscure-language-only techique for some applications, but this effects all of pretrain, so it might increase cost by a factor of 2.

What I really like about ancient language is that there's no online community the model could exploit. Even low-ressource modern languages have online forums an AI could use as an entry point.

But this consideration might be eclipsed by the fact that a rogue AI would have access to a translator before trying online manipulation, or by another scenario I'm not considering.

Agree with the lack of direct access to CoT being one of the major drawback. Though we could have a slightly smarter reporter that could also answer questions about CoT interpretation.

This is a terrible idea because as we all know Sumerian is uniquely suited for hacking humans' language-interpretation facilities and taking control of their brains.

Upvoted because this made me smile the whole time, not because it actually seems particularly useful. Building a safe medical-advice dispenser isn't all that important if SOTA architecture is pumping out superintelligent agents. And if for some reason you've built a superintelligent agent and then tried to box it into giving you medical advice, and it escapes, I don't think suspicious amounts of Sumerian on the internet is your main concern, although this would be an absolutely killer plot for a novel.

How do you translate modern terms line AI into sumerian?

I'm almost sure Sumerian had words for "artificial" and for "intelligence".

"Paracetamol" on the other hand... :-)

One could also imagine asking a group of Sumerian experts to craft new words for the occasion such that the updated language has enough flexibility to capture the content of modern datasets.

Why not go all the way and use a constructed language (like Lojban or Ithkuil) that's specifically designed for the purpose?

Or even better, Quenya or Þindarin.

What do you estimate the time lag to be if AI startup "Ishtar" utilized your proposed safety method while every other AI startup that Ishtar was competing with in the medical sector, didn't? 

It also seems to me like medical terminology would be extremely hard to translate into an ancient language where those words weren't known. The fine-tuning to correct for those errors would have to be calculated into that time lag as well. 

Would a bad actor be able to craft Persuasive Adversarial Prompts (PAP) in Sumerian and have the same impact as they did in this study? 

"Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over  on Llama 2-7b Chat, GPT-3.5, and GPT-4 in  trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs."

Source: https://arxiv.org/abs/2401.06373

What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like "if" could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.  

Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don't require language anyway. I suppose for image generating models, you could train the frontier model to apply an obfuscating filter and have the reporter restore the image iff it is relevant to medical research. 

Maybe my biggest concern is that I'm not convinced that we will be safe from manipulation by restricting the model's communication to being only about e.g. medical research. I don't have a precise story for how, but it seems plausible that a misaligned models could output instructions for medical research, which also have side-effects that work towards the model's secret goal.

Nitpicks aside, I like the idea of adding a safety layer by preventing the model to communicate directly with humans! And my concerns feel like technical issues that could easily be worked out given enough resources. 

Did you get this idea from stories about ancient Mesopotamian demons (like Pazuzu from The Exorcist)?

No, we were just brainstorming internship projects from first principles :)

It is good to see people thinking creatively, but a frontier model that becomes superhuman at physics and making plans that can survive determined human opposition is very dangerous even if it never learns how to read or understand any human language.

In other words, being able to interact verbally with humans is one avenue by which an AI can advance dangerous plans, but not the only avenue. (Breaking into computers would be another avenue where being able to communicate with humans might be helpful, but certainly not necessary.)

So, do you have any ideas on how to ensure that your Sumerian-schooled frontier model doesn't become superhuman at physics or breaking into computer?