Edoardo Pona

I am interested in Alignment, Mechanistic Interpretability, Agents, and the theory of how neural networks work. 


Sorted by New

Wiki Contributions


What do you mean exactly by 'in parallel' and 'in series' ?  

RLHF is already a counterexample. It's had an impact because of alignment's inherent dual use - is that cheating? No: RLHF is economically valuable, but Paul still put work into it for alignment before profit-seeking researchers got to it. It could have been otherwise.


I think this works as a counterexample in the case of Paul, because we assume that he is not part of the 'profit-seeking' group (regardless of whether its true, depending on what it means). However, as you pointed out earlier, alignment itself is crucial for the 'profit-seeking' case. 

Let's pretend for a minute that alignment and capabilities are two separate things. As you point out, a capable but misaligned (wrt. its creators' goals) is not profit generating. This means alignment as a whole is 'profitable', regardless of whether an individual researcher is motivated by the profit or not. The converse can be said about capability. It is not necessarily the case that a capability researcher is motivated by profit (in which case maybe they should focus on alignment at some point instead).  


I haven't yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by,  thus features can't be as 'packed'. 

Further evidence for what you show in this post, is the plot of the kurtosis of the activation vectors with varying dropout. Using kurtosis to measure the presence of a privileged basis is introduced in Privileged Bases in the Transformer Residual Stream.  Activations in a non-privileged basis should be from a distribution with kurtosis ~3 (isotropic gaussian). 

With a higher dropout, you also get a higher kurtosis, which implies a heavier tailed distribution of activations. 

Cool post! I have done some similar work, and my hypothesis for why dropout may inhibit superposition is that spreading out feature representations across multiple neurons exposes them to a higher chance of being perturbed. If we have a feature represented across  neurons, with dropout , the chance that the feature does not get perturbed at all is 

One thing I've always found inaccurate/confusing about the cave analogy is that humans interact with what they perceive (whether its reality or shadows) as opposed to the slave that can only passively observe the projections on the wall.  Even if we can't perceive reality directly, (whatever this means), by interacting with it we can poke at it to gain a much better understanding of things (eg. experiments). This extends to things we can't directly see or measure. 

ChatGPT is similar to the slave in the cave itself, having the ability to observe the real world, or some representation of it, but not (yet) interacting with it to learn (if we only consider it to learn at training time).  

Another way to understand this distinction is to think about the limit of infinite compute, where direct and amortized optimizers converge to different solutions. A direct optimizer, in the infinite compute limit, will simply find the optimal solution to the problem. An amortized optimizer would find the Bayes-optimal posterior over the solution space given its input data.

Doesn't this depend on how, practically, the compute is used? Would this still used in a setting where the compute can be used to generate the dataset? 

Taking the example of training an RL agent, an increase in compute will mostly (and more effectively) be used towards getting the agent more experience rather than using a larger network or model. 'Infinite' compute would be used to generate a correspondingly 'infinite' dataset, rather than optimising the model 'infinitely' well on some fixed experience.

In this case, while it remains true that the amortised optimiser will act closely following the data distribution it was trained on, this will include all outcomes - just like in the case of the direct optimiser, unless the data generation policy (ie. exploration) doesn't preclude visiting some of the states. So as long as the unwanted solutions that would be generated by a direct optimiser find a way in the dataset, the amortised agent will learn to produce them. 

In the limit of compute, to have an (amortised) agent with predictable behaviour you would have to analyse / prune this dataset.  Wouldn't that be as costly and hard as pruning the search of a direct optimiser of undesired outcomes?