Marius Hobbhahn

I'm currently doing a Ph.D. in ML at the International Max-Planck research school in Tübingen. I also do independent research. I'm currently excited about an agenda that is roughly "do the kind of mechanistic interpretability that will get us closer to detecting deceptive behavior in future models". If you think I should work for you, please reach out.

For more see

I subscribe to Crocker's Rules

Wiki Contributions


How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding. 

Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument. 

I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit. 

No plans so far. I'm a little unhappy with the experimental design from last time. If I ever come back to this, I'll change the experiments up anyways.

Could you elaborate a bit more about the strategic assumptions of the agenda? For example,
1. Do you think your system is competitive with end-to-end Deep Learning approaches?
1.1. Assuming the answer is yes, do you expect CoEm to be preferable to users?
1.2. Assuming the answer is now, how do you expect it to get traction? Is the path through lawmakers understanding the alignment problem and banning everything that is end-to-end and doesn't have the benefits of CoEm? 
2. Do you think this is clearly the best possible path for everyone to take right now or more like "someone should do this, we are the best-placed organization to do this"? 

PS: Kudos to publishing the agenda and opening up yourself to external feedback. 

fair. You convinced me that the effect is more determined by layer-norm than cross-entropy.

I agree that the layer norm does some work here but I think some parts of the explanation can be attributed to the inductive bias of the cross-entropy loss. I have been playing around with small toy transformers without layer norm and they show roughly similar behavior as described in this post (I ran different experiments, so I'm not confident in this claim). 

My intuition was roughly:
- the softmax doesn't care about absolute size, only about the relative differences of the logits.
- thus, the network merely has to make the correct logits really big and the incorrect logits small
- to get the logits, you take the inner product of the activations and the unembedding. The more aligned the directions of the correct class with the corresponding unembedding weights are (i.e. the smaller their cosine similarity), the bigger the logits.
- Thus, direction matters more than distance. 

Layernorm seems to even further reduce the effect of distance but I think the core inductive bias comes from the cross-entropy loss. 

I don't think there is a general answer here. But here are a couple of considerations:
- networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution.
- grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other.
- it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For example, I found that networks really don't like it if you set the weights to one-hot vectors such that one input only maps to one feature.
- My prediction for empirical experiments here would be something like "it might work in some cases but not be clearly better in the general case. It will also depend on a lot of annoying factors like weight decay and learning rate and the exact way you build the dictionary". 

I agree with everything you're saying. I just want to note that as soon as someone starts training networks in a way where not all weights are updated simultaneously, e.g. because the weights are updated only for specific parts of the network, or when the network has an external memory that is not changed every training step, gradient hacking seems immediately much more likely and much scarier. 

And there are probably hundreds of researchers out there working on modular networks with memory, so it probably won't take that long until we have models that plausibly have the capabilities to do gradient hacking. Whether they actually do it is a totally different question but it would be much easier to create a story of how the networks would gradient hack. 

This criticism has been made for the last 40 years and people have usually had new ideas and were able to execute them. Thus, on priors, we think this trend will continue even if we don't know exactly which kind of ideas they will be. 

In fact, due to our post, we were made aware of a couple of interesting ideas about chip improvements that we hadn't considered before that might change the outcome of our predictions (towards later limits) but we haven't included them in the model yet. 

Hmmm interesting. 

Can you provide some of your reasons or intuitions for this fast FOOM?

My intuition against it is mostly like "intelligence just seems to be compute bound and thus extremely fast takeoffs (hours to weeks) are unlikely". But I feel very uncertain about this take and would like to refine it. So just understanding your intuitions better would probably already help a lot. 

I think it's mostly my skepticism about extremely fast economic transformations. 

Like GPT-3 could probably automate more parts of the economy today but somehow it just takes a while for people to understand that and get it to work in practice. I also expect that it will take a couple of years between showing the capabilities of new AI systems in the lab and widespread economic impact just because humans take a while to adapt (at least with narrow systems). 

At some point (maybe in 2030) we will reach a level where AI is as capable as humans in many tasks and then the question is obviously how fast it can self-improve.  I'm skeptical that it is possible to self-improve as fast as the classic singularity story would suggest. In my mind, you mostly need more compute for training, new training data, new task design, etc. I think it will take some time for the AI to come up with all of that and even then, exponential demands just have their limits. Maybe the AI can get 100x compute and train a new model but getting 10000x compute probably won't happen immediately (at least in my mind; arguments for or against are welcome). 

Lastly, I wrote a story about my median scenario. I do have lots of uncertainty about how the TAI distribution should look like (see here) but my mode is at 2032-2035 (i.e. earlier than my median). So I could have also written a story with faster developments and it would reflect a slightly different corner of my probability distribution. But due to the reasons above, it would mostly look like a slightly faster version of this story. 

And your scenario is within the space of scenarios that I think could happen, I just think it's less likely than a less accelerationist, slower transition. But obviously not very confident in this prediction.

Load More