LESSWRONG
LW

Bayes' TheoremBayesianismInformation TheoryQuantum MechanicsWorld Modeling
Frontpage

31

A quantum equivalent to Bayes' rule

by dr_s
31st Aug 2025
10 min read
10

31

Bayes' TheoremBayesianismInformation TheoryQuantum MechanicsWorld Modeling
Frontpage

31

A quantum equivalent to Bayes' rule
11Adam Shai
2dr_s
5James Camacho
2dr_s
3James Camacho
3dr_s
4gjm
3James Camacho
2dr_s
2gjm
New Comment
10 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:11 AM
[-]Adam Shai7h110

Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.

Reply1
[-]dr_s7h20

Oh, that sounds interesting! Definitely gonna check this out.

Reply
[-]James Camacho11h50

Is there a reason they switched from divergence to fidelity when going quantum? You should want to get the classical Bayes' rule in the limit as your density matrices become classical, and fidelity definitely doesn't give you that.

Reply
[-]dr_s11h20

Quoting from the paper:

Fidelity is one of the most natural measures of the closeness between quantum states and has found countless applications in quantum information theory.

I agree that this sort of quantum relative entropy should also be doable. It's possible that the result would be the same. I guess an easy check would be to perturb the posterior and check whether this measure also has a minimum around the same point.

Reply
[-]James Camacho10h30

Yeah, that was about the only sentence I read in the paper. I was wondering if you'd seen a theoretical justification (logos) rather than just an ethical appeal (ethos), but didn't want to comb through the maths myself. By the way, fidelity won't give the same posterior. I haven't worked through the maths whatsoever, but I'd still put >95% probability on this claim.

Reply
[-]dr_s9h30

No, I don't think there's anything like that. I do wonder about deriving the same result for the divergence. I have no idea how hard that would be; it might even be quite easy. Possibly even reduces to something more Bayes-like in case of commutating operators. I'll try.

Reply
[-]gjm10h40

The title advertises a quantum version of Bayes' rule, but so far as I can tell the actual post never explicitly presents one. Am I missing something?

Reply
[-]James Camacho9h30

I think the title is fine. The post mostly reads, "if you want a quantum analogue, here's the path to take".

Reply
[-]dr_s9h20

The actual formula is in the paper. I explained the process that it is obtained from. The formula for the posterior looks quite abstruse, required me to explain more notation and ultimately doesn't give any particular useful intuitions on its face so I omitted it. You can also find it in my code.

Reply
[-]gjm4h22

Fair enough! I think the article would be improved by making this a bit more explicit somehow.

Reply
Moderation Log
More from dr_s
View more
Curated and popular this week
10Comments

This post is an attempt to summarise and explain for the LW readership the contents of this paper: "Quantum Bayes' rule and Petz transpose map from the minimum change principle". It's a highly technical paper heavy on quantum mechanics formalism that took me a couple of days to unpack and digest a bit, but I think it may be important going forward. My work on it is far from done, but this is a quick introduction.

Epistemic status: I have a Physics PhD and spent about ten years working with computational quantum mechanics so hopefully I know what I'm talking about, but if anyone can peer review I'll be glad for the help.

The tagline of Astral Codex Ten reads:

P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.

This sentence could very well exemplify the ethos of the rationalist community as a whole[1], but looking at it from a physics perspective, it misses something. Bayes' theorem is a statement about information - it tells us how to update previous knowledge (a distribution of probabilities over potential world-states) using newly acquired information to refine it. Yet, the way it defines the knowledge is classical. There are states, and there are finite, real probabilities (that sum to 1) attached to them. 

We know the world not to be classical. The world is quantum. Going into details about what this implies would make this post quite longer, but for the informational angle that we care about here I direct you to Scott Aaronson's excellent lecture on the topic and will only include here a very brief summary:

  • a quantum description of a system does not assign real probabilities to each state, but complex amplitudes, whose square magnitude sums to 1;
  • if you multiply an entire system by a constant phase factor, nothing changes; differences in phase between states however matter a lot;
  • quantum evolution acts on both amplitudes and phases, and this allows for interference phenomena that allow a lot of the weirder quantum stuff to happen, as probabilities don't need to sum up in the way we're used to and can even cancel each other out (this is why the infamous double slit experiment produces fringes);
  • the universe should by all means be a "pure" state; however in many situations it's convenient and possible to describe ensembles of physical subsystems to be in a mixed state, which represents a classical probability distribution of quantum states. This is mathematically represented with an object called a density matrix: its diagonal is a classic probability distribution over the state (and thus its trace is always 1), but its off-diagonal elements contain the phase information. For example, if we had a mole of atoms prepared in some quantum state with a certain distribution of uncertainty, this would be a good use case for a density matrix;
  • a density matrix whose off-diagonal elements are all zero is "decohered", and can be considered the classical limit of this. A decohered density matrix behaves exactly like a classical distribution, and follows classic Markovian dynamics;
  • from an information theory viewpoint, coherent quantum states have very different properties. This is what quantum computers are all about; if we exploit the mathematical structure of quantum mechanics, the laws about, for example, which problems are solvable in polynomial time change, because the interference phenomena allow for some new tricks that you couldn't otherwise pull off.

This tells us that in fact having a complete quantum description of information is really important, and we may be missing some crucial elements if we don't. Nevertheless, until now, I had not really seen any satisfactory quantum equivalent of Bayes' theorem. Even the so-called QBism (Quantum Bayesianism) interpretation of quantum mechanics seemed to lack this element, and operated more on a qualitative sense. While not everyone agrees and the philosophical debate still rages, it is at least reasonable (and often done) to consider the quantum state as a description of our knowledge about a system, rather than necessarily a true, physical thing. It is however strange that we don't really know how to update that knowledge freely the way we do with the classical kind.

This paper seems to remedy that. I'll go through the following steps to explain its contents:

  1. first, I'll re-derive the classical Bayes theorem using their formalism, which they build to create an analogy with the quantum case;
  2. then, I'll explain qualitatively the way this process is translated to quantum formalism (you can read the paper for the hard stuff but honestly I feel like I wouldn't add much to it)
  3. finally, I'll link some code I wrote and show a few examples which I hope are correct.

As I said, this is still very much a WIP. Let me know if you spot any mistakes or want to contribute.

Classical derivation

Suppose you have two systems, X and Y, each with a set of possible states:

X={x0,x1,...,xn}Y={y0,y1,...,ym}

We start with a probability distribution on X, γ(x), and we know also a conditional probability distribution, or likelihood (which is essentially a matrix) φ(y|x). Now suppose we observe a certain state y: then how should we update our knowledge of x? Bayes' rule says:

^φ(x|y)=φ(y|x)γ(x)∑iφ(y|xi)γ(xi)

Now, asks the paper, suppose that instead of a definite value y we observe over a certain number of trials a distribution of outcomes, τ(y). How are we to generalise this rule to update our knowledge about x? A natural extension is:

γ′(x)=∑j^φ(x|yj)τ(yj)

, of which the classic formulation is just a limit for when our τ is 1 in one state and 0 everywhere else. You might notice that this is a bit like a stochastic or Markov process, in which τ is the starting state and ^φ the transition matrix. This is the view taken by the paper - we consider the likelihood and the posterior probability to both be akin to processes which operate on one distribution to produce another[2]. So their approach to recovering Bayes' theorem is the following: 

  1. express the joint prior probability distribution of  X and Y, called Pfwd, by applying our original likelihood to our prior. This expresses our initial expectation or knowledge of the combined system;
  2. express the joint probability distribution informed by our new knowledge, Prev[3], by applying some hitherto unknown posterior distribution to our observed distribution on Y;
  3. minimise the distance[4] between the two distributions, which basically means "learn as much as you can from this new information and not one bit more", and under this "minimum change" principle find the correct formulation for the posterior distribution, which lo and behold, will turn out to be the well-known Bayes' rule!

Below the full derivation follows. Feel free to skip if the math is too much; as long as you follow me on the logic above, that should be enough.

Minimum change derivation of Bayes' rule

We define our joint distributions:

Pfwd(x,y)=φ(y|x)γ(x)Prev(x,y)=^φ(x|y)τ(y)

Here remember that since we're trying to recover Bayes' rule, ^φ(x|y) is our unknown quantity, that we're trying to retrieve through variational principles. We try to minimise the Kullback-Leibler divergence:

D(Pfwd,Prev)=∑x,yPfwd(x,y)log(Pfwd(x,y)Prev(x,y))

Subject to a normalisation constraint:

∑xPrev(x,y)=τ(y)⟹∑x^φ(x|y)=1

We can unify this problem by defining an objective function that makes use of Lagrange multipliers:

O(^φ,λ)=D(Pfwd,^φ∗τ)+∑yλy(∑x^φ(x|y)−1)min^φ,λO(^φ,λ)

To solve, we differentiate and solve for zero gradient:

∂O∂^φ(x|y)=−φ(y|x)γ(x)^φ(x|y)+λy=0

 

∂O∂λy=∑x^φ(x|y)−1=0

Isolating in the first equation:

^φ(x|y)=φ(y|x)γ(x)λy

Substituting in the second:

λy=∑xφ(y|x)γ(x)

Which then gives us back our Bayes' rule:

^φ(x|y)=φ(y|x)γ(x)∑xφ(y|x)γ(x)

C.V.D.

Quantum derivation

Now comes the spicy part - how do we make this process into a quantum one? The process analogy is crucial: we know how to apply transformations to quantum systems! The most general form of it is called a "quantum channel" and it allows you to express any kind of transformation from one state to another (including irreversible ones, which could for example simulate interaction with an outside environment - the important thing is that they have to preserve the trace, so that real probabilities always keep summing to 1). This usually means some kind of time evolution, but the formalism doesn't require that. So we can establish a correspondence:

  • a classical probability becomes quantum density matrix
  • a classical process (like the likelihood, or the posterior) becomes a quantum channel

As long as we can express the joint probability distribution as a quantum density matrix, we can then apply some measure of distance between states (the one they use is called quantum fidelity, though their convention does not include the squaring that appears on Wikipedia); given a prior density matrix, a quantum channel expressing the likelihood, and an observed end state, we can then maximise this fidelity (namely, minimise the distance) between the two joint probability distributions to find a "reversed" quantum channel that back-propagates the observed distribution into an updated knowledge of our system.

What happens in practice is that, given a quantum state in the Hilbert space HX, they "purify" it, namely, they sort of duplicate it so that we get a bigger state that contains two "copies" of it. Then we apply the quantum channel only to one of the two copies, which means we get a final state that is a joint description of both the starting point (the unaltered copy) and the end one (the copy to which the channel was applied).

We can of course do the same in reverse if we know the final state instead; duplicate it, apply the backwards channel to one of the copies (if we want them to be comparable, it has to be the other one compared to the forward process) and get another joint quantum state out.

Crunching some numbers

Here is the code I wrote, using Python and the library Qutip, for a quick test of this process on the simplest possible system (a single qubit subject to a probabilistic flip).

Here is a few outputs for very simple cases.

Uniform prior, decohered output

This is a purely classical case. We're starting with a uniform, decohered prior on the qubit, and after applying a spin flip with p=0.2 we observe a fully classical state τ=[0.8,0.2] (probabilities of up and down).

Starting gamma (prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Starting tau (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.68+0.j 0.  +0.j]
 [0.  +0.j 0.32+0.j]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Fidelity: 0.9486833043041707

The result is as expected from the classical Bayes' theorem: the updated knowledge on X is γ′=[0.68,0.32].

Coherent prior, output set to the observed state

This is a case of setting a coherent prior (the qubit is in a perfect up + down superposition) and then setting the observation to exactly match the output, which should retrieve the original state.

Starting gamma (prior):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Starting tau (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Fidelity: 0.9000000050662574

We see that this definitely does happen - the guess about X is correct. But the fidelity is not 1. This is not necessarily a contradiction - the matrices printed out here are merely "partial traces" which discard the fact that, this being a quantum description, there are correlations between the two subsystems that are expressed only in the full density matrix. So it's the correlations that are not printed out meaningfully here, but are still expressed in the off diagonal terms of the density matrix and contribute to the non-perfect fidelity. I assume this is kind of like: if you start with a prior that your coin is fair before observing any throws, vs if you observe 10 throws that fall perfectly 5 heads, 5 tails, your assumption on the coin's nature is not going to change, but your belief distribution is, and this is what's entailed. But there might be something more subtle I've missed.

Conclusion

I don't really have a good, impactful result to conclude this on. I wanted to quickly put this post out as I could enable others to also look at this and potentially contribute. My impression however is that there are some really interesting things to attempt here. One obvious thing that I might try next is quantum measurement - you can express that in terms of quantum channels, and "how do I do Bayesian inference back the original state of a system given the classical outcomes of a quantum measurement" seems like an interesting thing to investigate that might have some insights on the way our knowledge interacts with quantum systems.

 

  1. ^

    Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.

  2. ^

    If you want to think about them in linear algebra terms, since we're working with finite numbers of states:

    • a probability distribution is going to be a vector;
    • the likelihood and the posterior distributions are n×m and m×n matrices respectively;
    • the joint probability distributions are also matrices; they come about by multiplying the columns of those by the elements of a probability distribution;
    • an output distribution is produced by using a dot product between a process (matrix) and a probability distribution on a single system (vector), resulting in a probability distribution on the other system (vector).
  3. ^

    I assume it's supposed to stand for "reversed".

  4. ^

    They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.