OR PhD student at MIT

Wiki Contributions


For mechanistic interpretability research, we just released a new paper on neuron interpretability in LLMs, with a large discussion on superposition! See
Paper: https://arxiv.org/abs/2305.01610
Summary: https://twitter.com/wesg52/status/1653750337373880322

There has been some work on understanding in-context learning which suggests that models are doing literal gradient descent:

Superposition allows the model to do a lot of things at once. Thus, if the model wants to use its space efficiently, it performs multiple steps at once or uses highly compressed heuristics even if they don’t cover all corner cases. Especially in feed-forward models, the model can’t repeatedly apply the same module. Thus, implementing a “clean” algorithm would require implementing a similar algorithm in multiple layers which seems very space inefficient. 

I think the first and last sentence are inconsistent. Since superposition lets you do so much at once, you can get away with having this redundancy in every layer, especially if this capacity is dedicated to a general search process that would reduce loss across a wide variety of tasks.

When models are trained to solve small optimization tasks like Knapsack, they will not rediscover either the recursive or the dynamic programming solution

I think it depends on the problem size. If the number of layers is greater than the max weight, then I would expect the model to implement the DP algorithm (and if less, I wouldn't be surprised if it still implemented an approximate DP algorithm). 

In general though, I agree that the bounded depth is what makes it hard for a transformer to implement general optimization procedures.

Many parts of west Texas are also suitable for wind power which could potentially be interspersed within a large solar array. Increasing the power density of the land might make it cost effective to develop high energy industries in the area or justify the cost of additional infrastructure.

You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.

Would love to see more in this line of work.

We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations.

Could you explain how you are formulating/solving this optimization problem in more detail?

Could you describe your inner thought loop when conducting these sorts of mechanistic analyses? I.e., What Are You Tracking In Your Head? 

Indeed, it does seem possible to figure out where simple factual information is stored in the weights of a LLM, and to distinguish between knowing whether it "knows" a fact versus it simply parroting a fact.

In addition to Google scholar, connected papers is a useful tool to quickly sort through related work and get a visual representation of a subarea.