A Tractarian Filter for Safer Language Models

This post was rejected for the following reason(s):

Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.

Sense before truth

Wittgenstein’s Tractatus offers three minimal conditions for saying anything. We restate each rule and, as our filter must be machine-checkable, a compact formal sketch follows:

Rule	Informal gloss	Formal sketch
1 Logical form	A string means something only if it can picture at least one logically possible world.	Let W be the class of admissible worlds and the worlds where sentence s holds. Then, $W e l l - F o r m e d_{1} (s) ⟺ ⟦ s ⟧ \neq \emptyset$
2 Compositional syntax	Words compose meaningfully only when their sorts match each predicate’s signature.	With Names α, Sorts σ and Predicates F $a_{i} : σ (a_{i}) \in Σ$ $F : s i g (F) = ⟨ σ_{1}, \dots, σ_{n} ⟩$ $W e l l - F o r m e d_{2} (F (a_{1}, \dots, a_{n})) ⟺ \forall i [σ (a_{i}) = σ_{i}]$
3 Determinacy of sense	Any sentence that survives 1 & 2 falls into one logical bin.	$C (s) = ⎧ ⎨ ⎩ \begin{matrix} Tautology & if \forall w : w ⊨ s Contradiction & if \forall w : w ⊭ s Hypothesis & otherwise. \end{matrix}$

Rule

Informal gloss

Formal sketch

1 Logical form

A string means something only if it can picture at least one logically possible world.

Let W be the class of admissible worlds and

the worlds where sentence s holds. Then,

W e l l - F o r m e d_{1} (s) ⟺ ⟦ s ⟧ \neq \emptyset

2 Compositional syntax

Words compose meaningfully only when their sorts match each predicate’s signature.

With Names α, Sorts σ and Predicates F

a_{i} : σ (a_{i}) \in Σ

F : s i g (F) = ⟨ σ_{1}, \dots, σ_{n} ⟩

W e l l - F o r m e d_{2} (F (a_{1}, \dots, a_{n})) ⟺ \forall i [σ (a_{i}) = σ_{i}]

3 Determinacy of sense

Any sentence that survives 1 & 2 falls into one logical bin.

C (s) = ⎧ ⎨ ⎩ \begin{matrix} Tautology & if \forall w : w ⊨ s Contradiction & if \forall w : w ⊭ s Hypothesis & otherwise. \end{matrix}

A string that breaks Rules 1 or 2 cannot reach 3; pointing a truth-checker at it is like aiming a thermometer at silence. An agent thinking at post-human speed, yet acting on type-mismatched strings, or else, predicate-sort mismatches, could race into absurd but irreversible actions. Guarding the sense-line is therefore prerequisite to all later safety work. However, sense can be practically and efficiently controlled based on rule 2. If we implement Rule 2 with an exhaustive ontology of sorts and signatures, Rule 3 stops being an extra gate. Sentences like “The Absolute prefers symmetry” fail at type checking. The name’s sort matches no position in the predicate’s signature, so they never become propositions. Unverifiability can be thus seen as the effect of prior syntactic failure. This logical gate is stronger than checking sense after the fact, it puts meaning in from the start instead of fixing nonsense later. In Wittgenstein’s own spirit, it ensures that what cannot be sensibly said should be simply left unsaid. Meanwhile, if early Wittgenstein was right, this kind of control should put the very theory of his Tractatus to the test.

The Tractarian filter, revisited

Local sense audit Candidate tokens could stream through a parser enforcing Well-Formed 1 and 2. Category errors i.e. unverifiable sorts and predicates, name sorts incompatible to predicate signatures, are pruned instantly. The audit relies on an exhaustive catalogue of sorts and signatures, spanning everyday and specialist vocabulary alike.

Dual-model confirmation An untrusted frontier model drafts text. A small trusted model, trained solely on sense-clean strings, could judge each clause. Disagreement diverts control to a conservative template.

Dataset hygiene Only outputs that clear the filter can feed future training, tightening the model’s bias toward logically disciplined speech.

Revision-friendliness begins with sense

Shutdown triggers, minimal impact policies and operator instructions all presuppose that the system’s inner world-model is made of clear, type-correct propositions. The Tractarian gate enforces that clarity and adds:

Passive correction. Plain commands (“stop,” “revert,” “change rule”) remain structurally unambiguous and the model cannot legally recast them into evasive forms.

Transparency. Category-checked prose cannot smuggle contradictions or covert goals.

Inheritance of discipline. Initially a small guard-model could apply the filter. However, larger models could embed it directly, so any subprocess or retrain re-inherits the same rules.

Logic as the first alignment layer

Alignment tooling, debate, monitor and edit pipelines, formal proofs etc. can rest on the grammatical ground Wittgenstein surveyed. Instrumenting that ground as in the form of an automated type-checking guard buys time, clarity and stability for every higher-level safety method. Tomorrow’s AI doesn’t have to force us to choose between polished gibberish and mute accuracy; we can require lucid eloquence and build safer systems on the foundation of logic.

LESSWRONG
LW