Garrett Baker

Independent alignment researcher

40

My reading is their definition of conditional predictive entropy is the naive generalization of Shannon's conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.

For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.

Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.

More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions , with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).

The Shannon entropy of a random variable given is

Thus, the predictive entropy of a random variable given , only being able to condition using functions in would be

Where , if we'd like to use the notation of the paper.

And using this we can define predictive information, which as said before answers the question "how much more predictable is Y after we get the infromation X compared to no information?" by

which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).

20

Who is updating? I haven't seen anyone change their mind yet.

A Theory of Usable Information Under Computational Constraints

We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, V-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive V-information is more effective than mutual information for structure learning and fair representation learning.

h/t Simon Pepin Lehalleur

62

You may be interested in this if you haven’t seen it already: Robust Agents Learn Causal World Models (DM):

It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.

h/t Gwern

50

Ah yes, another contrarian opinion I have:

- Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.

31

When I accuse someone of overconfidence, I usually mean they're being too hedgehogy when they should be being more foxy.

20

Example 2: People generally seem to have an opinion of "chain-of-thought allows the model to do multiple steps of reasoning". Garrett seemed to have a quite different perspective, something like "chain-of-thought is much more about clarifying the situation, collecting one's thoughts and getting the right persona activated, not about doing useful serial computational steps". Cases like "perform long division" are the exception, not the rule. But people seem to be quite hand-wavy about this, and don't e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don't affect the final result.)

I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.

20

I'm skeptical a humanities education doesn't show up in earnings. Coming out of Daniel Gross and Tyler Cowen's Talent book, they argue a common theme they personally see among the very successful scouters of talent is the ability to "speak different cultural languages", which they also claim is helped along by being widely read in the humanities.

I expect it matters a lot less whether this is an autodidactic thing or a school thing, and plausibly autodidactic humanities is better suited for this particular benefit than school learned humanities, since if reading a text on your own, you can truly inhabit the world of the writer, whereas in school you need to constantly tie that world back into acceptable 12 pt font, double-spaced, times new roman, MLA formatted academic standards. And of course, in such an environment there are a host of thoughts you cannot think or argue for, and in some corners the conclusions you reach are all but written at the bottom of your paper for you.

Edit: I'll also note that I like Tyler Cowen's perspective on the state of humanities education among the populace, which he argues is at an all-time high, no thanks to higher education pushing it. Why? Because there is more discussion & more accessible discussion than ever before about all the classics in every field of creative endeavor (indeed, such documents are often freely accessible on Project Gutenberg, and the music & plays on YouTube), more & perhaps more interesting philosophy than ever before, and more universal access to histories and historical documents than there ever was in the past. The humanities are at an all-time high thanks to the internet. Why don't people learn more of them? Its not for lack of access, so subsidizing access will be less efficient than subsidizing the fixing of the actual problem, which is... what? I don't know. Boredom maybe? If its boredom, better to subsidize the YouTubers, podcasters, and TikTokers than the colleges (if you're worried about the state of humanities with regard to their own metrics of success--say, rhetoric--then who better to be the spokespeople?).

In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs

While that approach is potentially interesting by itself, it's probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a

multiplicativeweight update rather than an additive weight update as in standard SGD.Has this been tried/evaluated? Why actually yes - it's called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain 'sparse' input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with 'throw away half the theories per observation'.

The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it's more normal then AGD wins, if it's more sparse log-normal then EGD wins. The morale of the story is there isn't one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).

The exponential/multiplicative update is correct in Solomonoff's use case because the different sub-models are strictly

competingrather thancooperating:we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important - and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.

My read of this is we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.

This makes some sense. If you don't have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories *does* make striking predictions far different from your other theories, its probably wrong, just because striking things don't really happen.

In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.

But there are also reasons to think this is wrong. Hits based entrepreneurship approaches for example seem to be more foxy than standard quantitative or investment finance, and hits based entrepreneurship works precisely because the distribution of outcomes for companies is long-tailed.

In some sense the difference between the two is a "sin of omission" vs a "sin of commission" disagreement between the two approaches, where the hits-based approach needs to see how something could go right, while the standard finance approach needs to see how something could go wrong. So its not so much a predictive disagreement between the two approaches, but more a decision theory or comparative advantage difference.

I promise I won't just continue to re-post a bunch of papers, but this one seems relevant to many around these parts. In particular @Elizabeth (also, sorry if you dislike being at-ed like that).

Associations of dietary patterns with brain health from behavioral, neuroimaging, biochemical and genetic analyses

h/t Hal Herzog via Tyler Cowen