I like this idea, but it seems much better in practice to instead use some obscure but otherwise modern language like Welsh. I think this gets you most of the benefits with minimal substantially reduced cost.
One issue with this idea is that understanding what the model is doing via CoT or natural language communication bottlenecks will become worse due to lossy-ness in translation. I think if the AIs do most of their reasoning in CoT, this consideration is probably dominant and we should just use english.
We could try to only apply the obscure-language-only techique for some applications, but this effects all of pretrain, so it might increase cost by a factor of 2.
What I really like about ancient language is that there's no online community the model could exploit. Even low-ressource modern languages have online forums an AI could use as an entry point.
But this consideration might be eclipsed by the fact that a rogue AI would have access to a translator before trying online manipulation, or by another scenario I'm not considering.
Agree with the lack of direct access to CoT being one of the major drawback. Though we could have a slightly smarter reporter that could also answer questions about CoT interpretation.
This is a terrible idea because as we all know Sumerian is uniquely suited for hacking humans' language-interpretation facilities and taking control of their brains.
Upvoted because this made me smile the whole time, not because it actually seems particularly useful. Building a safe medical-advice dispenser isn't all that important if SOTA architecture is pumping out superintelligent agents. And if for some reason you've built a superintelligent agent and then tried to box it into giving you medical advice, and it escapes, I don't think suspicious amounts of Sumerian on the internet is your main concern, although this would be an absolutely killer plot for a novel.
I'm almost sure Sumerian had words for "artificial" and for "intelligence".
"Paracetamol" on the other hand... :-)
One could also imagine asking a group of Sumerian experts to craft new words for the occasion such that the updated language has enough flexibility to capture the content of modern datasets.
Why not go all the way and use a constructed language (like Lojban or Ithkuil) that's specifically designed for the purpose?
What do you estimate the time lag to be if AI startup "Ishtar" utilized your proposed safety method while every other AI startup that Ishtar was competing with in the medical sector, didn't?
It also seems to me like medical terminology would be extremely hard to translate into an ancient language where those words weren't known. The fine-tuning to correct for those errors would have to be calculated into that time lag as well.
Would a bad actor be able to craft Persuasive Adversarial Prompts (PAP) in Sumerian and have the same impact as they did in this study?
"Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over on Llama 2-7b Chat, GPT-3.5, and GPT-4 in trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs."
Source: https://arxiv.org/abs/2401.06373
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like "if" could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.
Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don't require language anyway. I suppose for image generating models, you could train the frontier model to apply an obfuscating filter and have the reporter restore the image iff it is relevant to medical research.
Maybe my biggest concern is that I'm not convinced that we will be safe from manipulation by restricting the model's communication to being only about e.g. medical research. I don't have a precise story for how, but it seems plausible that a misaligned models could output instructions for medical research, which also have side-effects that work towards the model's secret goal.
Nitpicks aside, I like the idea of adding a safety layer by preventing the model to communicate directly with humans! And my concerns feel like technical issues that could easily be worked out given enough resources.
It is good to see people thinking creatively, but a frontier model that becomes superhuman at physics and making plans that can survive determined human opposition is very dangerous even if it never learns how to read or understand any human language.
In other words, being able to interact verbally with humans is one avenue by which an AI can advance dangerous plans, but not the only avenue. (Breaking into computers would be another avenue where being able to communicate with humans might be helpful, but certainly not necessary.)
So, do you have any ideas on how to ensure that your Sumerian-schooled frontier model doesn't become superhuman at physics or breaking into computer?
Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.
Here’s one safety layer one could implement:
This safety layer helps in at least two ways:
Why Sumerian? [2]
How realistic is this? We think the project would require substantial engineering effort of a scale doable by the current AGI companies. A small-scale project fine-tuned a T5 model to translate 100k Sumerian to English with reasonable quality. This is evidence that translation in the other direction is doable. The resulting texts will probably not be fluent in Sumerian, but good enough to accurately describe the huge diversity of subjects contained in traditional LLM datasets. Even if there are too few Sumerian resources, companies could pick Latin or another ancient language, or even ask linguists to invent a language for the occasion.
What is this for? AI assistance seems important for many of the currently pursued agendas in top labs or upcoming labs (e.g. scalable oversight, alignment work by AI, creating a world simulation with AI expert programmers). Though there are cruxes for why none of these plans may work (e.g. that anything that can solve alignment is already too deadly), it’s still dignity that people who run these programs at least make strong efforts at safeguarding those systems and limit their downside risk. It would be a sign of good faith that they actually engage in highly effective boxing techniques (and all appropriate red teaming) for their most powerful AI systems as they get closer to human-level AGI (and stop before going beyond).
(Note that programs to use low-resource language such as Native American languages to obfuscate communication have been successfully used at a wide scale during the First and Second World Wars)
Principles for designing safety layers
We’re not actually vouching for this exact solution as is. However, we think it’s good food for thought. It draws attention to 3 points that we think are important to keep in mind when designing AI safety interventions:
The end of the paragraph 1-15 corresponds to the quote according to this reddit post.
Note that we are not Sumerian experts, we only spent ~ 2h researching Sumerian for this post. Feel free to point to our mistakes!