Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.
Read - thanks! Not digested in detail yet but enough to get the general outline. I didn't remember your name so was pleasantly surprised to realise this was a "sequel" to Transformers represent belief state geometry in their residual streams.
Focusing on the Bayesian part, if I understand correctly you do this on measurements though, so it's kind of semiclassical? Basically you measure one quantity, get an outcome , then normalise and use that as your token plus evolve the system with that operator. This would have to be an ensemble operation on prepared qubits in a real experiment since the measured quantities aren't orthogonal. I guess I could try and check whether we recover your update formula for the case of discrete measurements, but I'm not sure how to go about it off the top of my head.
The "Classical derivation" made more sense to me after translating it to standard probability notation, so I'm commenting to share the "dictionary" I made for it, as well as the unexpected extra assumption I had to make.
The obvious:
It got tricky with . Instead of observing , we observe something else that gives us a probability distribution over . I considered this "something else" to be the value of some other unknown: . The probability distribution over is a conditional distribution:
Hate to have on only one side like that... maybe I should have called it ... but I'll leave it as is.
Then,
Not quite the right formula for a simple interpretation of ... if only
This is conditional independence, which could be represented with this Bayes net:
Then, we have
That completes the dictionary.
So to do what feels like ordinary probability theory, I had to introduce this extra unknown so that we have something to observe, and then to assume that only provides information about (and indirectly about , through ).
The way you described as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization. The Stanford Encyclopedia of Philosophy gives this example:
A gambler is very confident that a certain racehorse, called Mudrunner, performs exceptionally well on muddy courses. A look at the extremely cloudy sky has an immediate effect on this gambler’s opinion: an increase in her credence in the proposition (muddy) that the course will be muddy—an increase without reaching certainty. Then this gambler raises her credence in the hypothesis (win) that Mudrunner will win the race, but nothing becomes fully certain. (Jeffrey 1965 [1983: sec. 11.3])
The idea is, we go from one probability distribution over to another, without becoming certain of anything. My introduction of corresponds to introducing an unknown representing the status of the sky. I would say we are conditioning on .
I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case. I took a look, and sure enough, it's in section 5.6, and he's pointing out exactly what I did, right down to the arrows, though he calls it a "logic flow diagram" rather than identifying it as a Pearl-style Bayes net.
I don't think you necessarily need a though. My interpretation of that step was "suppose we know is a constant but hidden reality, and is observable. Then we perform experiments and measure the resulting , and thus characterise a probability distribution of it. How does that inform our guess on ?". And yeah, that could be mediated by a third variable, but it doesn't need to. If is "the coin is fair, or the coin is loaded to land 75% of the time on heads" and is "the result of a coin toss", you get a better (lower entropy) belief distribution on after several tosses.
Thanks for this by the way, I used the paper's notation but agree it was a bit confusing so this probably helps people!
Yeah... well, I thought of the because it sounds like we're getting the probabilities of from some experiment. So is the results of the experiment, which in this case is a vector of frequencies. When I put it like that, it sounds like it's is just a rhetorical device for saying that we have given probabilities of .
But I still seem to need for my dictionary. I have . What is ? It is some kind of updated probability of , right? Like we went from one probability to the other by doing an experiment. If I didn't write , I'd need something like and .
Reading again, it seems like this is exactly Jeffrey conditionalization. So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.
I feel like I'm missing something, though, about what this experiment is and means. For example, I'm not totally clear on whether we have one state , and a collection of replicates of state ; or is it a collection of replicates of pairs?
Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly. And it mentions Pearl's "virtual evidence method"; is this what he calls introducing this ? But no clarity on exactly what this experiment is. It just says:
But how should the above be generalized to the situation where the new information does not come in the form of a definite value for , but as “soft evidence,” i.e., a probability distribution ?"
By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics. There are separate random variables for each coin toss. is the first, is the second, etc. If you have coin tosses, then your sample is a vector containing to . Then the posterior probability is . This will be covered in any Bayesian statistics textbook as "the Bernoulli model". My class used Hoff's book, which provides a quick start.
I guess this example suggests a single unknown (whether the coin is loaded or not) and replicates of .
Yes, I'm aware of the Bernoulli model - my point is that the vector is itself the outcome of that experiment; I suppose you can call it though it makes the notation a bit confusing. The general point is that yes, you update your belief about based on a series of outcomes on . In fact I think in general works just fine.
Is there a reason they switched from divergence to fidelity when going quantum? You should want to get the classical Bayes' rule in the limit as your density matrices become classical, and fidelity definitely doesn't give you that.
Quoting from the paper:
Fidelity is one of the most natural measures of the closeness between quantum states and has found countless applications in quantum information theory.
I agree that this sort of quantum relative entropy should also be doable. It's possible that the result would be the same. I guess an easy check would be to perturb the posterior and check whether this measure also has a minimum around the same point.
Yeah, that was about the only sentence I read in the paper. I was wondering if you'd seen a theoretical justification (logos) rather than just an ethical appeal (ethos), but didn't want to comb through the maths myself. By the way, fidelity won't give the same posterior. I haven't worked through the maths whatsoever, but I'd still put >95% probability on this claim.
So to add on this:
they may have chosen this way because it turns out taking the derivative of a matrix logarithm without certain guarantees of commutativity of the matrix with its own differential is really really hard. Which to be fair isn't a good reason per se, but yeah.
Also, the paper mentions that
the Kullback–Leiber divergence [7, 10], other f -divergences including Pearson divergence and Hellinger distance [34], zero-one loss [35], or the mean-square error of an estimation [36, 37]
and looking at it, the quantum fidelity reduces to one minus the Hellinger distance squared:
https://en.wikipedia.org/wiki/Hellinger_distance
So it's not in theory any worse or better than picking the K-L divergence, since all seem like a valid starting point; however it makes sense that this might be worth some further questioning.
EDIT: in addition, due to the nature of the matrix logarithm, the quantum K-L divergence has some serious drawbacks. It's basically the equivalent of the classic ones actually - if (the distribution at the denominator) is ever zero, the divergence goes to infinity. In quantum terms, that's if any one of the eigenvalues of is zero. So I think it's possible that they saw this as simply not well-behaved enough to be worth using.
No, I don't think there's anything like that. I do wonder about deriving the same result for the divergence. I have no idea how hard that would be; it might even be quite easy. Possibly even reduces to something more Bayes-like in case of commutating operators. I'll try.
The title advertises a quantum version of Bayes' rule, but so far as I can tell the actual post never explicitly presents one. Am I missing something?
The actual formula is in the paper. I explained the process that it is obtained from. The formula for the posterior looks quite abstruse, required me to explain more notation and ultimately doesn't give any particular useful intuitions on its face so I omitted it. You can also find it in my code.
I think the title is fine. The post mostly reads, "if you want a quantum analogue, here's the path to take".
This post is an attempt to summarise and explain for the LW readership the contents of this paper: "Quantum Bayes' rule and Petz transpose map from the minimum change principle". It's a highly technical paper heavy on quantum mechanics formalism that took me a couple of days to unpack and digest a bit, but I think it may be important going forward. My work on it is far from done, but this is a quick introduction.
Epistemic status: I have a Physics PhD and spent about ten years working with computational quantum mechanics so hopefully I know what I'm talking about, but if anyone can peer review I'll be glad for the help.
The tagline of Astral Codex Ten reads:
P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.
This sentence could very well exemplify the ethos of the rationalist community as a whole[1], but looking at it from a physics perspective, it misses something. Bayes' theorem is a statement about information - it tells us how to update previous knowledge (a distribution of probabilities over potential world-states) using newly acquired information to refine it. Yet, the way it defines the knowledge is classical. There are states, and there are finite, real probabilities (that sum to 1) attached to them.
We know the world not to be classical. The world is quantum. Going into details about what this implies would make this post quite longer, but for the informational angle that we care about here I direct you to Scott Aaronson's excellent lecture on the topic and will only include here a very brief summary:
This tells us that in fact having a complete quantum description of information is really important, and we may be missing some crucial elements if we don't. Nevertheless, until now, I had not really seen any satisfactory quantum equivalent of Bayes' theorem. Even the so-called QBism (Quantum Bayesianism) interpretation of quantum mechanics seemed to lack this element, and operated more on a qualitative sense. While not everyone agrees and the philosophical debate still rages, it is at least reasonable (and often done) to consider the quantum state as a description of our knowledge about a system, rather than necessarily a true, physical thing. It is however strange that we don't really know how to update that knowledge freely the way we do with the classical kind.
This paper seems to remedy that. I'll go through the following steps to explain its contents:
As I said, this is still very much a WIP. Let me know if you spot any mistakes or want to contribute.
Suppose you have two systems, and , each with a set of possible states:
We start with a probability distribution on , , and we know also a conditional probability distribution, or likelihood (which is essentially a matrix) . Now suppose we observe a certain state : then how should we update our knowledge of ? Bayes' rule says:
Now, asks the paper, suppose that instead of a definite value we observe over a certain number of trials a distribution of outcomes, . How are we to generalise this rule to update our knowledge about ? A natural extension is:
, of which the classic formulation is just a limit for when our is 1 in one state and 0 everywhere else. You might notice that this is a bit like a stochastic or Markov process, in which is the starting state and the transition matrix. This is the view taken by the paper - we consider the likelihood and the posterior probability to both be akin to processes which operate on one distribution to produce another[2]. So their approach to recovering Bayes' theorem is the following:
Below the full derivation follows. Feel free to skip if the math is too much; as long as you follow me on the logic above, that should be enough.
We define our joint distributions:
Here remember that since we're trying to recover Bayes' rule, is our unknown quantity, that we're trying to retrieve through variational principles. We try to minimise the Kullback-Leibler divergence:
Subject to a normalisation constraint:
We can unify this problem by defining an objective function that makes use of Lagrange multipliers:
To solve, we differentiate and solve for zero gradient:
Isolating in the first equation:
Substituting in the second:
Which then gives us back our Bayes' rule:
C.V.D.
Now comes the spicy part - how do we make this process into a quantum one? The process analogy is crucial: we know how to apply transformations to quantum systems! The most general form of it is called a "quantum channel" and it allows you to express any kind of transformation from one state to another (including irreversible ones, which could for example simulate interaction with an outside environment - the important thing is that they have to preserve the trace, so that real probabilities always keep summing to 1). This usually means some kind of time evolution, but the formalism doesn't require that. So we can establish a correspondence:
As long as we can express the joint probability distribution as a quantum density matrix, we can then apply some measure of distance between states (the one they use is called quantum fidelity, though their convention does not include the squaring that appears on Wikipedia); given a prior density matrix, a quantum channel expressing the likelihood, and an observed end state, we can then maximise this fidelity (namely, minimise the distance) between the two joint probability distributions to find a "reversed" quantum channel that back-propagates the observed distribution into an updated knowledge of our system.
What happens in practice is that, given a quantum state in the Hilbert space , they "purify" it, namely, they sort of duplicate it so that we get a bigger state that contains two "copies" of it. Then we apply the quantum channel only to one of the two copies, which means we get a final state that is a joint description of both the starting point (the unaltered copy) and the end one (the copy to which the channel was applied).
We can of course do the same in reverse if we know the final state instead; duplicate it, apply the backwards channel to one of the copies (if we want them to be comparable, it has to be the other one compared to the forward process) and get another joint quantum state out.
We have these correspondences:
Element | Classical | Quantum |
---|---|---|
Prior | Distribution | Density matrix |
Likelihood | Conditional distribution | Quantum channel |
Observed information | Distribution | Density matrix |
Posterior | Conditional distribution | Quantum channel |
Then the "posterior channel" that maximises fidelity is found to be:
with
. In the case that and commute (this will for example happen commonly if we take the prior to be uniform), this reduces to
, a formula also known as the Petz transpose map, and which is completely independent of .
Here is the code I wrote, using Python and the library Qutip, for a quick test of this process on the simplest possible system (a single qubit subject to a probabilistic flip).
Here is a few outputs for very simple cases.
Uniform prior, decohered output
This is a purely classical case. We're starting with a uniform, decohered prior on the qubit, and after applying a spin flip with we observe a fully classical state (probabilities of up and down).
Starting gamma (prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.j]
[0. +0.j 0.5+0.j]]
Starting tau (observed distribution on Y):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.68+0.j 0. +0.j]
[0. +0.j 0.32+0.j]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.8+0.j 0. +0.j]
[0. +0.j 0.2+0.j]]
Fidelity: 0.9486833043041707
The result is as expected from the classical Bayes' theorem: the updated knowledge on is .
Coherent prior, output set to the observed state
This is a case of setting a coherent prior (the qubit is in a perfect up + down superposition) and then setting the observation to exactly match the output, which should retrieve the original state.
Starting gamma (prior):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.5j]
[0. -0.5j 0.5+0.j ]]
Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.5j]
[0. -0.5j 0.5+0.j ]]
Starting tau (observed distribution on Y):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.5+0.j 0. -0.4j]
[0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.5+0.j 0. +0.4j]
[0. -0.4j 0.5+0.j ]]
Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.5+0.j 0. -0.5j]
[0. +0.5j 0.5+0.j ]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.5+0.j 0. +0.4j]
[0. -0.4j 0.5+0.j ]]
Fidelity: 0.9000000050662574
We see that this definitely does happen - the guess about is correct. But the fidelity is not 1. This is not necessarily a contradiction - the matrices printed out here are merely "partial traces" which discard the fact that, this being a quantum description, there are correlations between the two subsystems that are expressed only in the full density matrix. So it's the correlations that are not printed out meaningfully here, but are still expressed in the off diagonal terms of the density matrix and contribute to the non-perfect fidelity. I assume this is kind of like: if you start with a prior that your coin is fair before observing any throws, vs if you observe 10 throws that fall perfectly 5 heads, 5 tails, your assumption on the coin's nature is not going to change, but your belief distribution is, and this is what's entailed. But there might be something more subtle I've missed.
I don't really have a good, impactful result to conclude this on. I wanted to quickly put this post out as I could enable others to also look at this and potentially contribute. My impression however is that there are some really interesting things to attempt here. One obvious thing that I might try next is quantum measurement - you can express that in terms of quantum channels, and "how do I do Bayesian inference back the original state of a system given the classical outcomes of a quantum measurement" seems like an interesting thing to investigate that might have some insights on the way our knowledge interacts with quantum systems.
Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.
If you want to think about them in linear algebra terms, since we're working with finite numbers of states:
I assume it's supposed to stand for "reversed".
They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.