Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.
Is there a reason they switched from divergence to fidelity when going quantum? You should want to get the classical Bayes' rule in the limit as your density matrices become classical, and fidelity definitely doesn't give you that.
Quoting from the paper:
Fidelity is one of the most natural measures of the closeness between quantum states and has found countless applications in quantum information theory.
I agree that this sort of quantum relative entropy should also be doable. It's possible that the result would be the same. I guess an easy check would be to perturb the posterior and check whether this measure also has a minimum around the same point.
Yeah, that was about the only sentence I read in the paper. I was wondering if you'd seen a theoretical justification (logos) rather than just an ethical appeal (ethos), but didn't want to comb through the maths myself. By the way, fidelity won't give the same posterior. I haven't worked through the maths whatsoever, but I'd still put >95% probability on this claim.
No, I don't think there's anything like that. I do wonder about deriving the same result for the divergence. I have no idea how hard that would be; it might even be quite easy. Possibly even reduces to something more Bayes-like in case of commutating operators. I'll try.
The title advertises a quantum version of Bayes' rule, but so far as I can tell the actual post never explicitly presents one. Am I missing something?
I think the title is fine. The post mostly reads, "if you want a quantum analogue, here's the path to take".
The actual formula is in the paper. I explained the process that it is obtained from. The formula for the posterior looks quite abstruse, required me to explain more notation and ultimately doesn't give any particular useful intuitions on its face so I omitted it. You can also find it in my code.
This post is an attempt to summarise and explain for the LW readership the contents of this paper: "Quantum Bayes' rule and Petz transpose map from the minimum change principle". It's a highly technical paper heavy on quantum mechanics formalism that took me a couple of days to unpack and digest a bit, but I think it may be important going forward. My work on it is far from done, but this is a quick introduction.
Epistemic status: I have a Physics PhD and spent about ten years working with computational quantum mechanics so hopefully I know what I'm talking about, but if anyone can peer review I'll be glad for the help.
The tagline of Astral Codex Ten reads:
P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.
This sentence could very well exemplify the ethos of the rationalist community as a whole[1], but looking at it from a physics perspective, it misses something. Bayes' theorem is a statement about information - it tells us how to update previous knowledge (a distribution of probabilities over potential world-states) using newly acquired information to refine it. Yet, the way it defines the knowledge is classical. There are states, and there are finite, real probabilities (that sum to 1) attached to them.
We know the world not to be classical. The world is quantum. Going into details about what this implies would make this post quite longer, but for the informational angle that we care about here I direct you to Scott Aaronson's excellent lecture on the topic and will only include here a very brief summary:
This tells us that in fact having a complete quantum description of information is really important, and we may be missing some crucial elements if we don't. Nevertheless, until now, I had not really seen any satisfactory quantum equivalent of Bayes' theorem. Even the so-called QBism (Quantum Bayesianism) interpretation of quantum mechanics seemed to lack this element, and operated more on a qualitative sense. While not everyone agrees and the philosophical debate still rages, it is at least reasonable (and often done) to consider the quantum state as a description of our knowledge about a system, rather than necessarily a true, physical thing. It is however strange that we don't really know how to update that knowledge freely the way we do with the classical kind.
This paper seems to remedy that. I'll go through the following steps to explain its contents:
As I said, this is still very much a WIP. Let me know if you spot any mistakes or want to contribute.
Suppose you have two systems, and , each with a set of possible states:
We start with a probability distribution on , , and we know also a conditional probability distribution, or likelihood (which is essentially a matrix) . Now suppose we observe a certain state : then how should we update our knowledge of ? Bayes' rule says:
Now, asks the paper, suppose that instead of a definite value we observe over a certain number of trials a distribution of outcomes, . How are we to generalise this rule to update our knowledge about ? A natural extension is:
, of which the classic formulation is just a limit for when our is 1 in one state and 0 everywhere else. You might notice that this is a bit like a stochastic or Markov process, in which is the starting state and the transition matrix. This is the view taken by the paper - we consider the likelihood and the posterior probability to both be akin to processes which operate on one distribution to produce another[2]. So their approach to recovering Bayes' theorem is the following:
Below the full derivation follows. Feel free to skip if the math is too much; as long as you follow me on the logic above, that should be enough.
We define our joint distributions:
Here remember that since we're trying to recover Bayes' rule, is our unknown quantity, that we're trying to retrieve through variational principles. We try to minimise the Kullback-Leibler divergence:
Subject to a normalisation constraint:
We can unify this problem by defining an objective function that makes use of Lagrange multipliers:
To solve, we differentiate and solve for zero gradient:
Isolating in the first equation:
Substituting in the second:
Which then gives us back our Bayes' rule:
C.V.D.
Now comes the spicy part - how do we make this process into a quantum one? The process analogy is crucial: we know how to apply transformations to quantum systems! The most general form of it is called a "quantum channel" and it allows you to express any kind of transformation from one state to another (including irreversible ones, which could for example simulate interaction with an outside environment - the important thing is that they have to preserve the trace, so that real probabilities always keep summing to 1). This usually means some kind of time evolution, but the formalism doesn't require that. So we can establish a correspondence:
As long as we can express the joint probability distribution as a quantum density matrix, we can then apply some measure of distance between states (the one they use is called quantum fidelity, though their convention does not include the squaring that appears on Wikipedia); given a prior density matrix, a quantum channel expressing the likelihood, and an observed end state, we can then maximise this fidelity (namely, minimise the distance) between the two joint probability distributions to find a "reversed" quantum channel that back-propagates the observed distribution into an updated knowledge of our system.
What happens in practice is that, given a quantum state in the Hilbert space , they "purify" it, namely, they sort of duplicate it so that we get a bigger state that contains two "copies" of it. Then we apply the quantum channel only to one of the two copies, which means we get a final state that is a joint description of both the starting point (the unaltered copy) and the end one (the copy to which the channel was applied).
We can of course do the same in reverse if we know the final state instead; duplicate it, apply the backwards channel to one of the copies (if we want them to be comparable, it has to be the other one compared to the forward process) and get another joint quantum state out.
Here is the code I wrote, using Python and the library Qutip, for a quick test of this process on the simplest possible system (a single qubit subject to a probabilistic flip).
Here is a few outputs for very simple cases.
Uniform prior, decohered output
This is a purely classical case. We're starting with a uniform, decohered prior on the qubit, and after applying a spin flip with we observe a fully classical state (probabilities of up and down).
Starting gamma (prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Starting tau (observed distribution on Y):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.68+0.j 0. +0.j]
[0. +0.j 0.32+0.j]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Fidelity: 0.9486833043041707
The result is as expected from the classical Bayes' theorem: the updated knowledge on is .
Coherent prior, output set to the observed state
This is a case of setting a coherent prior (the qubit is in a perfect up + down superposition) and then setting the observation to exactly match the output, which should retrieve the original state.
Starting gamma (prior):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.5j]
[0. -0.5j 0.5+0.j ]]
Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.5j]
[0. -0.5j 0.5+0.j ]]
Starting tau (observed distribution on Y):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.5+0.j 0. +0.4j]
[0. -0.4j 0.5+0.j ]]
Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.5+0.j 0. +0.4j]
[0. -0.4j 0.5+0.j ]]
Fidelity: 0.9000000050662574
We see that this definitely does happen - the guess about is correct. But the fidelity is not 1. This is not necessarily a contradiction - the matrices printed out here are merely "partial traces" which discard the fact that, this being a quantum description, there are correlations between the two subsystems that are expressed only in the full density matrix. So it's the correlations that are not printed out meaningfully here, but are still expressed in the off diagonal terms of the density matrix and contribute to the non-perfect fidelity. I assume this is kind of like: if you start with a prior that your coin is fair before observing any throws, vs if you observe 10 throws that fall perfectly 5 heads, 5 tails, your assumption on the coin's nature is not going to change, but your belief distribution is, and this is what's entailed. But there might be something more subtle I've missed.
I don't really have a good, impactful result to conclude this on. I wanted to quickly put this post out as I could enable others to also look at this and potentially contribute. My impression however is that there are some really interesting things to attempt here. One obvious thing that I might try next is quantum measurement - you can express that in terms of quantum channels, and "how do I do Bayesian inference back the original state of a system given the classical outcomes of a quantum measurement" seems like an interesting thing to investigate that might have some insights on the way our knowledge interacts with quantum systems.
Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.
If you want to think about them in linear algebra terms, since we're working with finite numbers of states:
I assume it's supposed to stand for "reversed".
They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.