A quantum equivalent to Bayes' rule

[-]Adam Shai3mo140

Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.

[-]dr_s3mo20

Read - thanks! Not digested in detail yet but enough to get the general outline. I didn't remember your name so was pleasantly surprised to realise this was a "sequel" to Transformers represent belief state geometry in their residual streams.

Focusing on the Bayesian part, if I understand correctly you do this on measurements though, so it's kind of semiclassical? Basically you measure one quantity, get an outcome , then normalise and use that as your token plus evolve the system with that operator. This would have to be an ensemble operation on prepared qubits in a real experiment since the measured quantities aren't orthogonal. I guess I could try and check whether we recover your update formula for the case of discrete measurements, but I'm not sure how to go about it off the top of my head.

[-]dr_s3mo20

Oh, that sounds interesting! Definitely gonna check this out.

[-]transhumanist_atom_understander3mo80

The "Classical derivation" made more sense to me after translating it to standard probability notation, so I'm commenting to share the "dictionary" I made for it, as well as the unexpected extra assumption I had to make.

The obvious:

$φ (y | x) = P [Y = y | X = x]$

$^φ (x | y) = P [X = x | Y = y]$

It got tricky with $τ$ . Instead of observing $Y = y$ , we observe something else that gives us a probability distribution over $Y$ . I considered this "something else" to be the value of some other unknown: $Z = z$ . The probability distribution over $y$ is a conditional distribution:

$τ (y) = P [Y = y | Z = z]$

Hate to have $z$ on only one side like that... maybe I should have called it $τ_{z}$ ... but I'll leave it as is.

Then,

$γ^{'} (x) = \sum j P [X = x | Y = y_{j}] P [Y = y_{j} | Z = z]$

Not quite the right formula for a simple interpretation of $γ^{'}$ ... if only

$P [X = x | Y = y_{j}] = P [X = x | Y = y_{j}, Z = z]$

This is conditional independence, which could be represented with this Bayes net:

$Z \to Y \to X$

Then, we have

$γ^{'} (x) = P [X = x | Z = z]$

That completes the dictionary.

So to do what feels like ordinary probability theory, I had to introduce this extra unknown $Z$ so that we have something to observe, and then to assume that $Z$ only provides information about $Y$ (and indirectly about $X$ , through $Y$ ).

The way you described $τ$ as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization. The Stanford Encyclopedia of Philosophy gives this example:

A gambler is very confident that a certain racehorse, called Mudrunner, performs exceptionally well on muddy courses. A look at the extremely cloudy sky has an immediate effect on this gambler’s opinion: an increase in her credence in the proposition (muddy) that the course will be muddy—an increase without reaching certainty. Then this gambler raises her credence in the hypothesis (win) that Mudrunner will win the race, but nothing becomes fully certain. (Jeffrey 1965 [1983: sec. 11.3])

The idea is, we go from one probability distribution over ${muddy, \neg muddy}$ to another, without becoming certain of anything. My introduction of $Z$ corresponds to introducing an unknown representing the status of the sky. I would say we are conditioning on $Z = cloudy$ .

I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case. I took a look, and sure enough, it's in section 5.6, and he's pointing out exactly what I did, right down to the arrows, though he calls it a "logic flow diagram" rather than identifying it as a Pearl-style Bayes net.

[-]dr_s3mo20

I don't think you necessarily need a though. My interpretation of that step was "suppose we know $X$ is a constant but hidden reality, and $Y$ is observable. Then we perform $N$ experiments and measure the resulting $Y$ , and thus characterise a probability distribution of it. How does that inform our guess on $X$ ?". And yeah, that could be mediated by a third variable, but it doesn't need to. If $X$ is "the coin is fair, or the coin is loaded to land 75% of the time on heads" and $Y$ is "the result of a coin toss", you get a better (lower entropy) belief distribution on $X$ after several tosses.

Thanks for this by the way, I used the paper's notation but agree it was a bit confusing so this probably helps people!

[-]transhumanist_atom_understander3mo10

Yeah... well, I thought of the because it sounds like we're getting the probabilities of $Y$ from some experiment. So $Z = z$ is the results of the experiment, which in this case is a vector of frequencies. When I put it like that, it sounds like it's is just a rhetorical device for saying that we have given probabilities of $Y$ .

But I still seem to need $Z$ for my dictionary. I have $γ (x) = P [X = x]$ . What is $γ^{'} (x)$ ? It is some kind of updated probability of $X = x$ , right? Like we went from one probability to the other by doing an experiment. If I didn't write $γ^{'} (x) = P [X | Z = z]$ , I'd need something like $γ (x) = P_{1} [X = x]$ and $γ^{'} (x) = P_{2} [X = x]$ .

Reading again, it seems like this is exactly Jeffrey conditionalization. So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.

I feel like I'm missing something, though, about what this experiment is and means. For example, I'm not totally clear on whether we have one state $X$ , and a collection of replicates of state $Y$ ; or is it a collection of replicates of $(X, Y)$ pairs?

Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly. And it mentions Pearl's "virtual evidence method"; is this what he calls introducing this $Z$ ? But no clarity on exactly what this experiment is. It just says:

But how should the above be generalized to the situation where the new information does not come in the form of a definite value $y_{0}$ for $Y$ , but as “soft evidence,” i.e., a probability distribution $τ (y)$ ?"

By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics. There are separate random variables for each coin toss. $Y_{1}$ is the first, $Y_{2}$ is the second, etc. If you have $n$ coin tosses, then your sample is a vector $\to Y$ containing $Y_{1}$ to $Y_{n}$ . Then the posterior probability is $P [loaded | \to Y = \to y]$ . This will be covered in any Bayesian statistics textbook as "the Bernoulli model". My class used Hoff's book, which provides a quick start.

I guess this example suggests a single unknown $X$ (whether the coin is loaded or not) and replicates of $Y$ .

[-]dr_s3mo30

Yes, I'm aware of the Bernoulli model - my point is that the vector is itself the outcome of that experiment; I suppose you can call it $Z$ though it makes the notation a bit confusing. The general point is that yes, you update your belief about $X$ based on a series of outcomes on $Y$ . In fact I think in general $γ^{'} (x) = P [X = x | \to Y]$ works just fine.

[-]James Camacho3mo70

Is there a reason they switched from divergence to fidelity when going quantum? You should want to get the classical Bayes' rule in the limit as your density matrices become classical, and fidelity definitely doesn't give you that.

[-]dr_s3mo40

Quoting from the paper:

Fidelity is one of the most natural measures of the closeness between quantum states and has found countless applications in quantum information theory.

I agree that this sort of quantum relative entropy should also be doable. It's possible that the result would be the same. I guess an easy check would be to perturb the posterior and check whether this measure also has a minimum around the same point.

[-]James Camacho3mo30

Yeah, that was about the only sentence I read in the paper. I was wondering if you'd seen a theoretical justification (logos) rather than just an ethical appeal (ethos), but didn't want to comb through the maths myself. By the way, fidelity won't give the same posterior. I haven't worked through the maths whatsoever, but I'd still put >95% probability on this claim.

[-]dr_s3mo*30

So to add on this:

they may have chosen this way because it turns out taking the derivative of a matrix logarithm without certain guarantees of commutativity of the matrix with its own differential is really really hard. Which to be fair isn't a good reason per se, but yeah.

Also, the paper mentions that

the Kullback–Leiber divergence [7, 10], other f -divergences including Pearson divergence and Hellinger distance [34], zero-one loss [35], or the mean-square error of an estimation [36, 37]

and looking at it, the quantum fidelity reduces to one minus the Hellinger distance squared:

https://en.wikipedia.org/wiki/Hellinger_distance

So it's not in theory any worse or better than picking the K-L divergence, since all seem like a valid starting point; however it makes sense that this might be worth some further questioning.

EDIT: in addition, due to the nature of the matrix logarithm, the quantum K-L divergence has some serious drawbacks. It's basically the equivalent of the classic ones actually - if (the distribution at the denominator) is ever zero, the divergence goes to infinity. In quantum terms, that's if any one of the eigenvalues of $σ$ is zero. So I think it's possible that they saw this as simply not well-behaved enough to be worth using.

[-]dr_s3mo30

No, I don't think there's anything like that. I do wonder about deriving the same result for the divergence. I have no idea how hard that would be; it might even be quite easy. Possibly even reduces to something more Bayes-like in case of commutating operators. I'll try.

[-]gjm3mo41

The title advertises a quantum version of Bayes' rule, but so far as I can tell the actual post never explicitly presents one. Am I missing something?

[-]dr_s3mo30

The actual formula is in the paper. I explained the process that it is obtained from. The formula for the posterior looks quite abstruse, required me to explain more notation and ultimately doesn't give any particular useful intuitions on its face so I omitted it. You can also find it in my code.

[-]gjm3mo24

Fair enough! I think the article would be improved by making this a bit more explicit somehow.

[-]dr_s3mo30

That's all right, thanks for the feedback - I've added a section with the formula proper!

[-]James Camacho3mo30

I think the title is fine. The post mostly reads, "if you want a quantum analogue, here's the path to take".

[-]Pretentious Penguin4d30

a density matrix whose off-diagonal elements are all zero is "decohered", and can be considered the classical limit of this. A decohered density matrix behaves exactly like a classical distribution, and follows classic Markovian dynamics;

I don't think this bullet point is accurate. Any pure state will have all its off-diagonal elements be zero in a basis where that state is one of the basis vectors, but it's not fair to say that any pure state "behaves exactly like a classical distribution". I suppose it would be more accurate to say that a state whose off-diagonal entries are all zero in some basis will look classical with respect to dynamics and measurements in that basis, but that concept is hard to explain unless the idea of observables corresponding to Hermitian operators has already been explained.

[+][comment deleted]3mo10

Element	Classical	Quantum
Prior	Distribution $γ$	Density matrix $γ$
Likelihood	Conditional distribution $φ (y \| x)$	Quantum channel $E$
Observed information	Distribution $τ$	Density matrix $τ$
Posterior	Conditional distribution $^φ (x \| y)$	Quantum channel $R$

^{^}

Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.

^{^}

If you want to think about them in linear algebra terms, since we're working with finite numbers of states:

a probability distribution is going to be a vector;
the likelihood and the posterior distributions are $n \times m$ and $m \times n$ matrices respectively;
the joint probability distributions are also matrices; they come about by multiplying the columns of those by the elements of a probability distribution;
an output distribution is produced by using a dot product between a process (matrix) and a probability distribution on a single system (vector), resulting in a probability distribution on the other system (vector).

^{^}

I assume it's supposed to stand for "reversed".

^{^}

They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.

LESSWRONG
LW

LESSWRONG
LW

52

A quantum equivalent to Bayes' rule

52

52

Classical derivation

Minimum change derivation of Bayes' rule

Quantum derivation

The final result

Crunching some numbers

Conclusion