Replying toWhat's the Right Way to think about Information Theoretic quantities in Neural Networks?

Guillaume CorlouerJan 21, 2025

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Another perspective would be too look at the activations of an autoregressive deep learning model, e.g. a transformer, during inference as a stochastic process: the collection of activation $(X_{t})$ at some layer as random variables indexed by time t, where t is token position.

One could for example look at mutual information between the history $X_{t}^{-} = (X_{t}, X_{t - 1}, . . .)$ and the future of the activations $X_{t + 1}$ , or look at (conditional) mutual information between the past and future of subprocesses of $X_{t}$ (note: transfer entropy can be a useful tool to quantify directed information flow between different stochastic processes). There are many information-theoretic quantities one could be looking at.

If you want to formally define a probability distribution over activations, you could maybe push forward... (read more)

Replying toIs AI Physical?

Guillaume Corlouer1y

Is AI Physical?

An interesting analogy, closer to ML, would be to look at neuroscience. It's an older field than ML, and it seems that the physics perspective has been fairly productive, even though not successful at providing a grand unified theory of cognition yet. Some examples:

Using methods from electric circuits to explain neurons (Hodgkin-Huxley model, cable theory)
Dynamical systems to explain phenomena like synchronization in neuronal oscillations (ex: Kuramoto model)
Ising models to model some collective behaviour of neurons
Information theory is commonly used in neuroscience to analyze neural data and model the brain (ex: efficient coding hypothesis)
Attempts at general theories of cognition like predictive processing, or the free energy principle which also have a strong physics

... (read more)

Replying toThe purposeful drunkard

Guillaume Corlouer1y

The purposeful drunkard

Interesting! Perhaps one way to not be fooled by such situations could be to use a non-parametric statistical test. For example, we could apply permutation testing: by resampling the data to break its correlation structure and performing PCA on each permuted dataset, we can form a null distribution of eigenvalues. Then, by comparing the eigenvalues from the original data to this null distribution, we could assess whether the observed structure is unlikely under randomness. Specifically, we’d look at how extreme each original eigenvalue is relative to those from the permuted data to decide if we can reject the null hypothesis that the data is random. Could be computationally prohibitive though, but maybe there are some ways to go around that (reducing the number of permutations, downsampling the data....). Also note that failing to reject the null does not mean that the data is random. Somehow it does not look like these kind of tests are common practice in interpretability?

An information-theoretic study of lying in LLMs

Annah

Annah, Guillaume Corlouer

We recently published a workshop paper at ICML 2024 which we want to summarize here briefly. Code for the paper can be found on GitHub. This work was done as part of the PIBBSS affiliateship. Thx to @Jayjay for comments on the draft.

TLDR: We analyze the probability distribution over token space in hidden LLM layers for a model that is instructed to lie and to tell the truth. We extract this probability distribution by applying the logit/tuned lens to the residual stream. We find large differences between lying and truth-telling in the entropy and KL-divergence of the predictive distribution and in the probability of the predicted token. The results suggest that the... (read 980 more words →)

Degeneracies are sticky for SGD

Guillaume Corlouer

Guillaume Corlouer, Nicolas Macé

Introduction

Singular learning theory (SLT) is a theory of learning dynamics in Bayesian statistical models. It has been argued that SLT could provide insights into the training dynamics of deep neural networks. However, a theory of deep learning inspired by SLT is still lacking. In particular it seems important to have a better understanding of the relevance of SLT insights to stochastic gradient descent (SGD) – the paradigmatic deep learning optimization algorithm.

We explore how the degeneracies^[1] of toy, low dimensional loss landscapes affect the dynamics of stochastic gradient descent (SGD).^[2] We also investigate the hypothesis that the set of parameters selected by SGD after a large number of gradient steps on a degenerate landscape is distributed... (read 4653 more words →)

Replying toTowards Measures of Optimisation

Guillaume Corlouer3y

Towards Measures of Optimisation

Right, I got confused because I thought your problem was about trying to define a measure of optimisation power - for ex analogous to the Yudkowsky measure - that was also referring to a utility function, while being invariant from scaling and translation but this is different from asking

"what fraction of the default expected utility comes from outcomes at least as good as this one?’"

Replying toTowards Measures of Optimisation

Guillaume Corlouer3y

Towards Measures of Optimisation

What about optimisation power of as a measure of outcome that have utility greater than the utility of $x^{'}$ ?

Let $U_{x^{'}}$ be the set of outcome with utility greater than $x^{'}$ according to utility function $u$ :

U_{x^{'}} := {x \in X | u (x) \geq u (x^{'})}

The set $U_{x^{'}}$ is invariant under translation and non-zero rescaling of the utility function $u$ and we define the optimisation power of the outcome $x$ ' according to utility function $u$ as:

O P_{u} (x^{'}) := - log (\int_{U_{x^{'}}} p) = - log (\int_{x \in U_{x^{'}}} p (x) d x)

This does not suffer from comparing w.r.t a worst case and seem to satisfies the same intuition as the original OP definition while referring to some utility function.

This is in fact the same measure as the original optimisation power measure with an order given by the utility function

Understanding mesa-optimization using toy models

tilmanr

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, Can

Overview

Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally
We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context.
This post lays out our model training setup, hypotheses we have, and the experiments we are performing and plan to perform. Experimental results will be forthcoming in our next post.
We invite members of the LW community to challenge our hypotheses and the potential relevance of this line of work. We will follow up soon with some early results^[1]. Our main source code is open source, and we are open to collaborations.

Introduction

Some threat models of... (read 2943 more words →)

Replying to"Brain enthusiasts" in AI Safety

Guillaume Corlouer4y

"Brain enthusiasts" in AI Safety

Nice recommendations! In addition to brain enthusiasts being useful for empirical work, there also are theoretical tools from system neuroscience that could be useful for AI safety. One area in particular would be for interpretability: if we want to model a network at various levels of "emergence", recent development in information decomposition and multivariate information theory to move beyond pairwise interaction in a neural network might be very useful. Also see recent publications to model synergestic information and dynamical independance to perhaps automate macro variables discovery which could also be well worth exploring to study higher levels of large ML models. This would actually require both empirical and theoretical work as once the various measures of information decomposition are clearer one would need to empirically estimate test them and use them in actual ML systems for interpretability if they turn out to be meaningful.

Replying toMetalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer6y

Metalignment: Deconfusing metaethics for AI alignment.

Thanks for all the useful links! I'm also always happy to receive more feedback.

I agree that the sense in which I use metaethics in this post is different from what academic philosophers usually call metaethics. I have the impression that metaethics, in academic sense, and metaphilosophy are somehow related. Studying what morality itself is, how to select ethical theories and what is the process behind ethical reasoning seems not independent. For example if moral nihilism is more plausible then it seems to be less likely that there is some meaningful feedback loop to select ethical theories or that there is such a meaningful thing as a ‘good’ ethical theory (at least... (read more)

Replying toMetalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer6y

Metalignment: Deconfusing metaethics for AI alignment.

Sure, I'm happy to read/discuss your ideas about this topic.

Replying toMetalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer6y

Metalignment: Deconfusing metaethics for AI alignment.

I am not sure about what computer aided analysis mean but one possibility could be to have formal ethical theories and prove some theorem inside their formal framework. But this raises questions about the sort of formal framework that one could use to 'prove theorems' under ethics in a meaningful way.

Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

Epistemic status: MSFP blog post day. General and very speculative ideas.

Proposition : Deconfusing metaethics might be a promising way to increase our chances of solving AI alignment.

What do I mean by metaethics?

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories. Let's have a look at mathematics for an analogy. Part of the mathematical production involves using some theory of logic to prove or disprove some conjecture about some mathematical object. Theorems, lemmas and properties that one can derive from axioms working with some logic is, roughly, part of how mathematics progresses. Another analogy is... (read 859 more words →)

LESSWRONG
LW

LESSWRONG
LW

Guillaume Corlouer

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

An information-theoretic study of lying in LLMs

Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

Guillaume Corlouer

An information-theoretic study of lying in LLMs

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

An information-theoretic study of lying in LLMs

Metalignment: Deconfusing metaethics for AI alignment.

Guillaume Corlouer

Guillaume Corlouer

An information-theoretic study of lying in LLMs

Degeneracies are sticky for SGD

Understanding mesa-optimization using toy models

Metalignment: Deconfusing metaethics for AI alignment.

Introduction

Overview

Introduction

What do I mean by metaethics?