I recently went through the paper Definability of Truth in Probabilistic Logic for a second time. Explaining things to others often helps me solidify my own knowledge, so I'm doing a walkthrough of the paper.

Within, I explain things that I found confusing and expand on sections where the paper is somewhat brief. This is designed to be something that I could send to a copy of myself that has not read the paper, along with the paper, to help my clone learn the concepts as fast as possible. Your mileage may vary: its quite likely that your sticking points differ from my own.

I'll assume competence in the subject area, including knowledge of Tarski's theorem and the impossibility of adding a truth predicate to a sufficiently expressive language.

Because this paper is an early draft, I've included a few suggestions for edits that would have helped me out. You can find them throughout, or collected at the bottom of the post.

Motivation

Section 1 of the paper is well put together. For posterity, I'll summarize it here:

  • Given a sufficiently expressive language of first-order logic, you cannot embed that language's truth predicate into the language itself. For if a language has access to its own truth predicate, it can express the liar's paradox: G ⇔ True('¬G').
  • Responses to this include:
    • Working with a tower of meta-languages, each containing the previous system's truth predicate
    • Allowing some sentences must take on a third "undefined" value, instead of true or false

This paper explores a third alternative: assign probabilities to sentences instead of binary (or trinary) truth-values. This is potentially far stronger than a tri-valued logic: it's all well and good to say that some sentences are "undefined", but it turns out that many sentences of interest — not just pathological paradoxes — become undefined. A probability function, by contrast, allows you to retain much more information about sentences which cannot be assigned values "true" or "false".

Preliminaries

Throughout the paper, we'll fix a language L that we're working with. It doesn't matter what language L is, so long as it's strong enough to perform a Gödel numbering, and that it has terms corresponding to the rational numbers.

We also consider a particular theory T (for example, ZFC) and assume that the rationals have their usual properties in T. Furthermore:

  • 'φ' (φ, in quotes) will denote the Gödel encoding of the sentence φ.
  • Q will denote the set of all rational numbers. I point this out because I lack a script font.

The paper is a little lax when throwing around variable names, especially in second section. Here's a quick run-down:

  • P refers to both a function and a symbol. I'll try to be explicit about which is which. In general, the function P operates on sentences, whereas the symbol P is contained in a sentence, and operates on quoted sentences. So P(φ) refers to the output of the function on the sentence φ, and P('φ') refers to the symbol P applied to a Gödel encoding of the sentence φ. This is usually not ambiguous.
  • μ is used for probability measures, and does not tend to be fixed. Remember that a probability measure is defined over a set of objects. It assigns a measure to each object such that the measure of the whole set is 1, while obeying the laws of probability.
  • T generally refers to the theory under consideration (eg ZFC), but there are times where it doesn't. I'll point them out.

Probability Predicates

We want a probability function defined on sentences of L. What does that actually mean?

We begin by considering all functions P that map sentences of L onto real numbers. Note that P could be any function from L onto reals. It could be that P takes "x" to 100 and "x or x" to -3. Don't let the suggestive name 'P' fool you: we haven't yet narrowed down the behavior of functions under consideration.

Clearly, such functions cannot in general be viewed as measuring the "probability" of a sentence. What, precisely, do we mean when we say we want a probability function for our logical language?

Intuitively, we want a P that treats sentences in a "consistent" manner. More formally, we want there to be some underlying probability measure μ over models such that P is "talking about" μ.

Note that we do not require P be a mere probability measure over sentences of L! That would be too weak a claim: such a P could assign 1 to the sentence x ∧ ¬x while being a perfectly good probability measure over sentences, though clearly such a P (which assigns 1 to a contradiction!) is not what we had in mind when we went hunting for a probability function.

Claiming that P obeys a probability measure μ over models of L is a much stronger claim. It states that there is some way μ of assigning probabilities to models such that P(φ) is the probability of the sentence φ according to how μ weights models. For example, such a P is forced to assign 0 to every contradiction, because there are no models that model a contradiction.

Let's make that explicit: We want P(φ) to be the proportion of all models (weighted by μ) that model φ. In other words, P(φ) = μ({M : M⊧φ}).

We call such a P "coherent". Coherent P are quite well-behaved: They map all contradictions to 0 (as no model models a contradiction) and all tautologies to 1 (as every model models a tautology). Coherent P also act like probability functions: we know that P(φ)=P(φ∧ψ)+P(φ∧¬ψ) for any φ and ψ, because clearly the models that model φ consist of the models that model both φ and ψ, and the models that model φ but don't model ψ.

These three properties are necessary for coherent P. Turns out, they're also sufficient for a coherent P. This may be rather surprising: remember that P is otherwise unrestricted. So the above claim is equivalent to saying that these three axioms are sufficient to make a coherent P:

  1. For all φ and ψ, P(φ)=P(φ∧ψ)+P(φ∧¬ψ)
  2. For all tautologies φ, P(φ)=1
  3. For all contradictions φ, P(φ)=0
The paper then seems to make the assumption that these axioms constrain the range of P to [0, 1] (or, equivalently, that these axioms are sufficient to guarantee that P maps logically equivalent sentences to the same value). This is not obvious to me, and may in fact be incorrect. (See discussion, below.)

But *assuming* that the range of P is [0, 1], the rest of the proof is fairly simple. For formal details, see the paper. I will only sketch the proof.

We're going to give a method for constructing theories. In this section, T refers to the constructed theory, NOT the T that we're holding fixed. Also, we're going to be assessing the probability of the theory under construction, using syntax like P(Ti). This is just the probability of the sentence constructed by conjuncting all sentences in the theory Ti. We can always assess this because each Ti will be finite.

Fix an enumeration of all sentences φi. Set T0 to be the empty set.

Note: The paper claims that "By axiom 3, P(T0)=1". I believe this is a typo: first of all, it should read "By axiom 2". Second of all, it's not obvious to me that P must assign a probability to the empty sentence, nor that it should be 1. This is, however, easily worked around by defining P(φ|Ti) to be P(φ) if i=0. For all other Ti, P(φ|Ti) is defined in the usual way.

Now iterate sentences, considering only sentences independent of the theory Ti built so far. For each independent sentence φj, set Tj=Ti∪{φj} with probability P(φj|Ti) and Tj=Ti∪{¬φj} with probability P(¬φj|Ti).

Define T=∪Ti, this is a complete consistent theory. At each step i along the way, the probability that the sequence derives φ is P(φ|Ti). Thus, the sequence of Ti is a martingale. Because T is complete and consistent, P(φ|Ti) stabilizes at either 0 or 1. Thus, the probability that any given T generated by this procedure derives φ is P(φ|T0), which is just P(φ).

Note: Briefly, a martingale is a sequence where the expectation of the next value is equal to the present observed value. I'm not entirely clear on why you also need the fact that P(φ|Ti) stabalizes at either 0 or 1 before concluding that T⊢φ with probability P(φ), due to inexperience with martingales. Regardless, the point is that given a probability function P over sentences, you can generate theories in such a way that the resulting theory models φ with probability P(φ), and this is not too surprising.

We have just given a process for constructing theories at random. This process defines a probability measure over complete consistent theories: every complete consistent theory has a particular probability of being generated by this process.

The chance that the T coming out of this process derives φ is P(φ). In other words, we have defined a μ such that μ({T : T⊢φ})=P(φ). By the completeness theorem, each complete consistent theory has a model, so this process also defines a measure μ over models such that P(φ)=μ({M:M⊧φ}). This is exactly what we wanted.

We have now showed that any P following the above axioms is coherent in that it acts as if there is an underlying probability measure over models of L such that P(φ) measures the probability of selecting a model (weighted by μ) that models φ.

Reflective Probability

Let's go back to considering our fixed theory T. Our goal is to allow T to "talk about" the "subjective probability" of its sentences: in other words, we want to extend the language L to include a coherent probability function P.

We can't include just any P, obviously. Specifically, we want to embed a probability function that talks about the probability of sentences of T: In other words, it must map all sentences derived by T to probability 1, and then assign probability values to sentences not derived by T. (Otherwise, it clearly isn't talking about T.) It's easy to see that probability functions that assign probability 1 to all sentences of T correspond to probability functions derived from some probability measure over models of T.

Pick a coherent probability function P that assigns probability 1 to all sentences of T. We now want to extend the language L to include a symbol P which is intended to represent the function P. Call the extended language L'.

It's not enough to just shove P into L and call it a day. We know that P assigns "sane" probabilities to sentences of L, but we have no idea how P reacts to sentences of L': the fact that P was sane when it was talking about L does not mean that P will remain sane when it can start referring to itself.

For example, imagine a probability function P such that P(φ)=1, but P(P('φ')=1)=0. This is legal, so long as P is coherent and P(φ)=1 is consistent with T. However, it's clearly not the probability function we're looking for, for it lies about itself: we would like the probability function to interpret the symbol P as the function P itself.

We could simply require this property, and consider only P where

∀φ∈L'. ∀a,b∈Q. a<P(φ)<b ⇔ P(a<P('φ')<b)=1

This translates to "whenever the function P says that the probability of φ is between rational numbers a and b, the function P also says that the sentence a < P('φ') < b has probability 1". In other words, this narrows our search down to functions P that treat the symbol P exactly as the function P itself acts.

Unfortunately, no such P exists. For imagine that there is such a P, and consider the sentence G ⇔ P('G')<1. Then

P(G)<1 ⇔ P(P('G')<1)=1 ⇔ P(G)=1

In other words, P(G)<1 ⇔ P(G)=1. Because P is coherent and has range [0, 1], this is a contradiction: no such P exists.

I'll call this the "strong reflective consistency" requirement, and as shown above it is unsatisfiable for exactly the same reason that we can't define a predicate True such that ∀φ∈L'. True(φ) ⇔ True(True('φ')).

So we must weaken our requirements. Fortunately, it turns out we don't have to weaken them very much. Instead of requiring that P can talk about itself exactly, we allow P to talk about itself with arbitrarily small (infinitesimal) error. In other words, we require

∀φ∈L'. ∀a,b∈Q. a<P(φ)<b ⇒ P(a<P('φ')<b)=1

If there is such a P, we're in business: While sentences of the language L' cannot precisely assert the probability of a sentence, they can do so with arbitrarily small error. Before we discuss whether such a P exists, let's make a few notes and see an example.

First, the above (weakened) "reflection principle" above implies the following:

∀φ∈L'. ∀a,b∈Q. ¬(a≤P('φ')≤b) ⇒ P(a≤P('φ')≤b)=0

Because if P('φ') is not in [a, b] then either P(P('φ')<a)=1 or P(P('φ')>b)=1. This can be rephrased as

∀φ∈L'. ∀a,b∈Q. P(a≤P('φ')≤b)>0 ⇒ a≤P('φ')≤b

which is the "other direction" of our weak reflection principle.

Second, an example: Consider the sentence G⇔P('G')<p for some p∈(0,1]. A reflectively consistent P must assign P(G)=p: If it assigns P(G)>p then P(P('G')>p)=1 so P(G)=0 (contradiction) and if it assigns P(G)<p then P(P('G')<p)=1 so P(G)=1 (contradiction). If we required the "strong" reflection principle, then P could not exist, because P(G)=p would imply P(P('G')<p)=0 and thus P(G)=0, a contradiction. But with the weak reflection principle, no such contradiction can follow, because P(G)=p does not force P(P('G')<p)=0: the symbol P is "uncertain" about the true value of the function P, and cannot prove that P('G') is precisely p.

Models of L' with a reflectively consistent P will be able to show that P(G)∈[p-ε,p+ε] for arbitrarily small ε, but cannot prove that P(G)=p. This is the mechanism by which consistency is maintained.

Finding P

The question is, are there (weakly) reflectively consistent P? We know that there are not strongly reflectively consistent P, just as there are not consistent truth predicates. Before we can get excited about reflectively consistent P, we have to prove that there are some.

I lack the knowledge to fully understand this proof, as it uses theorems from topology that I do not yet understand. I can, however, give you a sketch:

Consider the set A of coherent probability distributions over L' that assign probability 1 to T. View it as a subset of the space [0,1]^L'. A is convex, closed, and (by Tychonoff's theorem) compact. A is also non-empty, because there is at least one model of T: we can make a probability distribution that assigns 1 to M and 0 to any other model.

Given any P, we can construct the set of axioms that describe that P. For every φ, we add every sentence a<P('φ')<b for every rational interval (a, b) which does indeed contain P(φ). For each P, call this set Rp. We say that a probability function "obeys" Rp if it assigns probability 1 to every sentence in Rp. We need to show that there is some P with obeys its own reflexivity axioms.

We can define a function f : A → Powerset(A) such that f(P') = {P∈A : P(Rp')=1}. In other words, f takes probability functions P' to the set of probability functions that obey the reflectivity axioms of P'.

Each f(P) is convex and non-empty by the same arguments as above. We show that f has a closed graph and apply Kukatani's fixed point theorem to show that f has a fixed point. This guarantees the existence of some P such that P∈f(P), which means P obeys its own reflexivity axioms.

Thus, there exists a reflectively consistent P.

Note: I don't understand Kukatani's fixed point theroem. I need to brush up on my topology. In the meantime, the point is that we can construct a function from probability functions P onto the set of probability functions obeying the reflectivity axioms of P, and that this function has a fixed point.

Since f in general has a fixed point, we can say in general that reflectively consistent P do exist.

Knowing your limits

Section 3.3 is fairly clear and worth a read. For posterity, I'll summarize it below.

Note that a reflectively consistent P does not necessarily "know" it is reflectively consistent. Such a P is reflectively consistent, but it may not claim to be. In fact, if a reflectively consistent P assigns probability 1 to the general statement of its own reflective consistency, a contradiction can be derived.

We can weaken the reflection principle further, but it is not yet known whether there is a version that both captures the interesting behavior of reflection and which is both true of P and asserted in P.

I'll close with the closing quote of the paper:

[this] work shows that the obstructions presented by the liar's paradox can be overcome by tolerating an infinitesimal error, and that Tarski's result on the undefinability of truth is in some sense an artifact of the infinite precision demanded by reasoning about complete certainty.

Discussion

I don't have much to add at this time. I clearly need to read up on topology, and am doing so. The result is very interesting, and I'm looking forward to playing around with other reflection principles.

Next up is a walkthrough of Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.

Compiled Notes

  1. p3. I, like John Baez here, originally thought that axiom 3 was unnecessary. With a little help from Daniel Dewey, I realized that P is not restricted to [0, 1], nor even restricted to map logically equivalent sentences to the same probability. Now I'm not convinced that axioms 1, 2, and 3 are sufficient to force P to act like a probability function. If they are, I think the paper should either show that any P following the three axioms has range [0,1] or that it maps logically equivalent sentences to the same value. (There's some discussion about this in the comments.)

  2. p3. It's weird to "fix a theory T" at the beginning of section 2.1, and then "Define T=∪Ti" on p3. It's not particularly confusing in context, but it irks the programmer in me.

  3. p3. I'm pretty sure that "By axiom 3, P(T0)=1" is a typo. Furthermore, it's not clear to me that we should treat the empty set like a tautological sentence. I would have appreciated a sentence after "Let T0=∅" to the effect of "P(Ti) is the value of P on the conjunction of all Ti, which we can do because each Ti is finite", with special treatment given to the case where Ti is empty.

  4. p4. I misparsed "which contradicts P(G)∈[0,1]" on the first pass. I took Range(P)=[0,1] as background knowledge and thought the claim should have been "so P(G)∈[0,1) and P(G)=1, a contradiction". I concluded there was a typo. This may have been a personal fluke. The meaning was clear, regardless.

New Comment
30 comments, sorted by Click to highlight new comments since:

I'm having a hard time showing that P maps equivalent statements to the same value, or that it's bounded. In particular, I'm having difficulty showing P(p && q) = P(q && p); it seems like none of the axioms lets you switch the order of two atoms. What's the trick I'm missing?

How about if axiom 1 read "P(x) = P(x && y) + P(¬y && x)"?

Then we could say P(q && p) = P(q && p && ¬p) + P(p && q && p) = P(p && q && p) by axioms 1' and 3, and similar for P(p && q).

So P(q && p) = P(p && q && p) = P(p && q).

I think your proof secretly turns ¬¬p into p in the first step, which isn't legit. However, your modification does give equality on logically equivalent sentences, in the following way:

Suppose p <=> ~q. Then P(p) = P(p && q) + P(~q && p) = P(~q && p) by 3 P(~q) = P(~q && p) + P(~p && ~q) = P(~q && p) by 3

Hence if p is equivalent to q and at least one starts with a ~ sign, then P is equal on them. Now suppose p <=> q with neither one starting with a ~ sign: we have

P(p) = P(p) = P(q) = P(q)

and we're done.

I think your proof secretly turns ¬¬p into p in the first step, which isn't legit.

It does, drat.

P(p) = P(~~p)

Ok. So to fix/complete my proof we need P(¬¬p && q && p) = P(p && q && p). Hm. Ok. So to prove they're equal we just add on another term using axiom 1 and then rule out the contradiction in such away that we show they're both equal to the same thing.

Or, once you have equality for logically equivalent sentences, note that (p && q) <=> (q && p) and hence we have directly that P of the two sides are equal.

Yes, the axioms seem incomplete, or perhaps it was simply meant to be implied that "P(p) = P(q & p) + P(¬q & p)" also. Otherwise as far as I can tell there's no axiom that lets you relate any expression containing "P(p" to an expression containing "P(q", unless the arguments of P(·) are each a tautology or contradiction (which is unhelpful).

Well, a tautology can be made up of non-tautological things; we could conceivably have some sentence phi(p, q) that's a tautology if p <=> q, such that P(p) = f(P(phi(p,q))) = f(P(phi(q,p))) = P(p). I think this is what ygert is trying to do. I don't have much hope for this approach, though.

I'd think that the way you'd prove it is with the fact that (p && q)<=>(q && p) is a tautology. I don't have an exact proof at the moment; let me work on it.

After working on the problem, I am convinced we also need an order-swapped version of axiom 1. If we had that, we could prove that any pair of equivalent statements have that same value: the general case of the problem benkuhn posed.

(If A and B are equivalent, then P(A&~B)=P(B&~A)=0 as contradictions, and so by axiom 1:

P(A)=P(A&B)+P(A&~B)=P(A&B)+0=P(A&B)

P(B)=P(B&A)+P(B&~A)=P(B&A)+0=P(B&A)

So close. If only we could swap the orders, we'd have P(A)=P(B).)

I tried applying the proof of the theorem to the problem, as P(q) = P(p) for equivalent statements p,q is clear from the claim P(q) = mu({M: M|= q}) so our desired result should be part of the proof of Theorem 1. However it seems that our desired result is implicitly assumed in the statement "Axiom 1 implies". Setting P(A|B) = P(B && A)/P(B) (note the reversal of order, without being allowed to replace equivalent statements inside P the mess is even greater if we pick the natural order. Also we want P(B) =/= 0) and writing psi for phi j we can write the claim as:

P(Ti && phi)/P(Ti) = P(Ti && psi && phi)/P(Ti && psi) P(Ti && psi)/P(Ti) + P(Ti && ~psi && phi)/P(Ti && ~psi) P(Ti && ~psi)/P(Ti)

(Here I ignored some annoying brackets, actually there's more of a mess as its not clear if P((A && B) && C) = P(A && (B && C))). Multiplying both sides by P(Ti) and simplifying now gives:

P(Ti && phi) = P(Ti && psi && phi) + P(Ti && ~psi && phi)

which does not follow from Axiom 1 as the added statements are in the middle of our statement. I am convinced that our claim (P(p) = P(q) for equivalent statements p,q) is required for the proof in the paper and should be added as an extra axiom.

On a sidenote:unde the assumption above (equivalent statements get equal probabilities) axiom 3 is a consequence of axioms 1 and 2. This can be seen as follows: Let q be a contradiction, so ~q is a tautology. By axioms 1 and 2 we have P(q) = P(q && q) + P(q && ~q). But ~q is a tautology, so p && ~q and p are equivalent for every p. In particular we have P(q && ~q) = P(q). But q and (q && q) are also equivalent, so the above can be written as P(q) = P(q) + P(q) so P(q) = 0.

I, too, am now doubtful that axioms 1-3 are sufficient. I've updated the post accordingly.

Yeah, I couldn't find anything either.

As Manfred and I showed above, replacing axiom 1 with "P(x) = P(x && y) + P(¬y && x)" gives a sufficient condition, though.

For if a language has access to its own truth predicate, it can express the liar's paradox: G ⇔ True('G').

You probably mean G ⇔ True('¬G')

Indeed I do. Fixed, thanks.

Nitpick, the link in the first sentence reads "Definability of Truth in Probabilistic Locic" rather than logic.

Fixed, thanks.

You have mentioned the weakened reflection principle as being the following: ∀φ∈L'. ∀a,b∈Q. a≤P(φ)≤b ⇒ P(a<P('φ')<b)=1

This seems to be a typo, it should be ∀φ∈L'. ∀a,b∈Q. a<P(φ)<b ⇒ P(a<P('φ')<b)=1

Right you are. Fixed, thanks.

I'm confused by a couple minor points here, also:

  1. The paper asks for a "probability distribution over models of L". In fact, for many languages L, models of L form a proper class. Does this cause measure-theoretic difficulties? It seems like this might force mu to be zero on all sufficiently large models (otherwise you can do some sort of transfinite induction to get sets of unbounded measure) but I'm not very good at crazy set theory stuff.

  2. At one point the authors state "We would like P(forall phi in L' )". I thought we were in a first-order language and therefore couldn't quantify over propositions?

  3. It's not immediately clear to me that this actually constructs a measure on the set of theories: that is, if S is the set of all complete consistent theories, it's not clear to me that for the mu we construct by martingale, mu(S) = 1 (or even that mu(S) != 0). Mightn't additivity break when we take the limit and get a whole theory rather than just an incomplete bag of axioms?

  1. Can we instead do "probability distribution over equivalence classes of models of L", where equivalence is determined by agreement on the truth-values of all first order sentences? There's only 2^ℵ₀ of those, and the paper never depends on any distinction within such an equivalence class.

Yes, though we should just call it a "probability distribution over complete consistent theories" in that case (it's exactly the same).

  1. There are definitely some probability distributions over proper classes that are useful (for example, a measure that assigns .5 to one model, .5 to another, and zero to the rest). No model would ever be forced to have measure 0, as you can always construct the measure that assigns 1 to that particular model and 0 to all the others. But as to whether or not there are other difficulties with defining a probability measure over a proper class, I have no idea. I, too, lack skill with crazy set theory.

  2. You're referring to page 7? I believe that it means to say "we would like a P that obeys the axiom schema P(forall a, b in Q ... phi ...) for all phi in L". You're right, though, this is somewhat ambiguous.

  3. I don't completely understand your question. Are you questioning whether T=UTi is actually complete and consistent? Compactness guarantees that it is consistent, and the enumeration of sentences guarantees it is complete.

  1. I meant that (conjecturally) for every measure, there exists a cardinal kappa such that mu({M: |M| > kappa}) = 0. Anyway, I guess as you've demonstrated the set/class thing isn't a big problem, but it is something to watch out for.

  2. Okay, that makes sense.

  3. No, I was observing the following: mu is countably additive, and the set of theories is countable. Hence the measure of the total space is the sum of the measures of the theories, so the measures of the theories must sum to 1. Now it's clear that at every step i of the process, the sum of the measures of the (incomplete) theories so obtained is 1. But it's not immediately clear to me that this holds in the limit.

    However, I just realized my mistake, which is that the set of theories isn't always countable (there are countably many sentences, but a theory is a subset of the sentences; for instance, consider the language with countably many unary relations and a constant symbol). In particular, I believe it's countable if and only if the sum is preserved in the limit, so we're fine.

For a countable language L and theory T (say, with no finite models), I believe the standard interpretation of "space of all models" is "space of all models with the natural numbers as the underlying set". The latter is a set with cardinality continuum (it clearly can't be larger, but it also can't be smaller, as any non-identity permutation of the naturals gives a non-identity isomorphism between different models).

Moreover this space of models has a natural topology, with basic open sets {M: M models phi} for L-sentences phi, so it makes sense to talk about (Borel) probability measures on this space, and the measures of such sets. (I believe this topology is Polish, actually making the space Borel isomorphic to the real numbers.)

Note that by Lowenheim-Skolem, any model of T admits a countable elementary substructure, so to the extent that we only care about models up to some reasonable equivalence, countable models (hence ones isomorphic to models on the naturals) are enough to capture the relevant behavior. (In particular, as pengvado points out, the Christiano et al paper only really cares about the complete theories realized by models, so models on the naturals suffice.)

Great post! For anyone reading this who isn't familiar with model theory, by the way, the bit about

sentence G ⇔ P('G')<1. Then

may not be obvious. That is, we want a sentence G which is true iff P('G') < 1 is true. The fact that you can do this is a consequence of the diagonal lemma, which says that for any reasonable predicate 'f' in a sufficiently powerful language, you can find a sentence G such that G is true iff f(G) is true. Hence, defining f(x) := P('x') < 1, the lemma gives us the existence of G such that G holds iff f(G) holds, ie, iff P('G') < 1 as desired.

Mostly I bring this up because the diagonal lemma is among the most interesting results in early model theory. It has a simple statement and is how self-reference is constructed, which is what gives us the incompleteness theorems. If anyone is interested in getting in to model theory, looking up the proof and background for the proof would be a great place to start.

The empty set is tautological because P({}) = P({} and something) + P({} and not-something) = P(something) + P(not-something). Hm, but that's using axioms 1 and 2. Can we get it using just axiom 3 as the paper claims?

When you say that P({} and something) = P(something), you suppose the hypothesis (in addition to using several nontrivial consequences of coherence that So8res mentioned, like P mapping equivalent statements to the same thing).

More importantly, "{} and something" isn't a syntactically correct sentence. I don't think most authors consider the empty sentence syntactically correct either. (Marker, the textbook I used, doesn't.)

Whoops, you're right.

This should be finitely additive probability measures, right? Just saying "probability measure" usually means countably additive.

I'll defer to the paper here, which states

De nition 1 (Coherence). We say that P is coherent if there is a probability measure over models of L such that P(phi)=mu({M:M models phi})

That said, I'm not aware of a reason why we should require P be backed by a finitely additive probability measure.