Interlingua-llm

This post was rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.

So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*

"English is my second language, I'm using this to translate"

If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly.

"What if I think this was a mistake?"

For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.

you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

If any of those are false, sorry, we will not accept your post.

* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)

Abstract

Large Language Models (LLMs) are typically trained on heterogeneous natural language corpora. This introduces redundancy, noise, and the necessity to allocate parameters for handling multiple linguistic structures. We propose the idea of using artificial intermediate languages (AILs) as a compressed, noise-free representation for training. Instead of mapping raw natural language directly into embeddings, a preprocessing stage could translate natural text into a canonical artificial language. This may reduce entropy, improve generalization, and lead to more efficient parameter utilization.

Motivation

Training multilingual models incurs substantial computational overhead. Each natural language brings idiosyncratic grammar, orthography, and irregularities. LLMs must allocate capacity to disambiguate and represent these differences. An artificial language could provide:

Compression: Removing redundant morphology and syntax.

Normalization: Standardizing meaning representation across languages.

Noise Reduction: Excluding low-quality or inconsistent samples.

Method (Proposed)

Artificial Language Design: Create a symbolic or token-based canonical language optimized for LLM consumption.

Preprocessing Pipeline: Translate input natural languages into AIL before feeding them into the model.

Model Training: Train on AIL directly, possibly with fewer layers or smaller embedding sizes.

Postprocessing/Decoding: Map model outputs back into natural languages for human interpretability.

Discussion

AIL could function analogously to Intermediate Representations (IRs) in compilers.

This approach may decouple representation learning from communication in natural languages.

Forward pass efficiency would not reduce the number of layers but could reduce hidden size requirements due to the lower entropy of input data.

Potential drawbacks include the complexity of building robust bidirectional AIL-to-natural language mappers.

Conclusion

Artificial languages could serve as a powerful abstraction for LLM training. While forward-pass depth (number of layers) remains constant, hidden dimension requirements and noise-related redundancy could decrease. This may yield significant GPU efficiency gains, enabling smaller, faster, and more interpretable models.

LESSWRONG
LW