All of Marius Hobbhahn's Comments + Replies

How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding. 

Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument. 

I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit. 

I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.  Garg et al. 2022 [] examine the performance of 12-layer GPT-style transformers trained to do few-shot learning and show that they can in-context learn 2-layer MLPs.  The performance of their model closely matches an MLP with GD for 5000 steps on those same few-shot examples, and it cannot be explained by heuristics like averaging the K-nearest neighbors from the few-shot examples.  Since the inputs are fairly high-dimensional, I don't think they can be performing this well by only memorizing the weights they've seen during training.  The model is also fairly robust to distribution shifts in the inputs at test time, so the heuristic they must be learning should be pretty similar to a general-purpose learning algorithm.   I think that there also is some amount of mechanistic evidence that transformers implement some sort of iterative optimization algorithm over some quantity stored in the residual stream.   In one of the papers mentioned above (Akyurek et al. 2022 []), the authors trained a probe to extract the ground-truth weights of the linear model from the residual stream and it appears to somewhat work.  The diagrams seem to show that it gets better when trained on activations from later layers, so it seems likely that the transformer is iteratively refining its prediction of the weights.

No plans so far. I'm a little unhappy with the experimental design from last time. If I ever come back to this, I'll change the experiments up anyways.

Could you elaborate a bit more about the strategic assumptions of the agenda? For example,
1. Do you think your system is competitive with end-to-end Deep Learning approaches?
1.1. Assuming the answer is yes, do you expect CoEm to be preferable to users?
1.2. Assuming the answer is now, how do you expect it to get traction? Is the path through lawmakers understanding the alignment problem and banning everything that is end-to-end and doesn't have the benefits of CoEm? 
2. Do you think this is clearly the best possible path for everyone to take right now o... (read more)

fair. You convinced me that the effect is more determined by layer-norm than cross-entropy.

I agree that the layer norm does some work here but I think some parts of the explanation can be attributed to the inductive bias of the cross-entropy loss. I have been playing around with small toy transformers without layer norm and they show roughly similar behavior as described in this post (I ran different experiments, so I'm not confident in this claim). 

My intuition was roughly:
- the softmax doesn't care about absolute size, only about the relative differences of the logits.
- thus, the network merely has to make the correct logits really big an... (read more)

I agree that there's many reasons that directions do matter, but clearly distance would matter too in the softmax case!  Also, without layernorm, intermediate components of the network could "care more' about the magnitude of the residual stream (whereas it only matters for the unembed here), while for networks w/ layernorm the intermediate components literally do not have access to magnitude data!

I don't think there is a general answer here. But here are a couple of considerations:
- networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution.
- grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other.
- it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For examp... (read more)

I agree with everything you're saying. I just want to note that as soon as someone starts training networks in a way where not all weights are updated simultaneously, e.g. because the weights are updated only for specific parts of the network, or when the network has an external memory that is not changed every training step, gradient hacking seems immediately much more likely and much scarier. 

And there are probably hundreds of researchers out there working on modular networks with memory, so it probably won't take that long until we have models that... (read more)

This criticism has been made for the last 40 years and people have usually had new ideas and were able to execute them. Thus, on priors, we think this trend will continue even if we don't know exactly which kind of ideas they will be. 

In fact, due to our post, we were made aware of a couple of interesting ideas about chip improvements that we hadn't considered before that might change the outcome of our predictions (towards later limits) but we haven't included them in the model yet. 

Hmmm interesting. 

Can you provide some of your reasons or intuitions for this fast FOOM?

My intuition against it is mostly like "intelligence just seems to be compute bound and thus extremely fast takeoffs (hours to weeks) are unlikely". But I feel very uncertain about this take and would like to refine it. So just understanding your intuitions better would probably already help a lot. 

3Daniel Kokotajlo3mo
Every time I sit down to make a model of takeoff, or read someone else's model & input my own values for the parameters, it ends up being pretty fast. Much faster than your story. (In fact, I'm not sure I've ever seen a model that results in a takeoff as slow as your story, even with other people's values to the parameters.) That's the main reason why I have faster-takeoff views. There's a big gap between "hours to weeks" and "10+ years!" I don't think intelligence is compute bounded in the relevant sense, but even if it was (see my "Main alternative" in response to Richard elsewhere in this thread) it would maybe get us to a 3 year post-AGI takeoff at most, I'd say. If you have time to elaborate more on your model -- e.g. what you mean by intelligence being compute bounded, and how that translates into numbers for post-AGI takeoff speed -- I'd be interested to hear it!  

I think it's mostly my skepticism about extremely fast economic transformations. 

Like GPT-3 could probably automate more parts of the economy today but somehow it just takes a while for people to understand that and get it to work in practice. I also expect that it will take a couple of years between showing the capabilities of new AI systems in the lab and widespread economic impact just because humans take a while to adapt (at least with narrow systems). 

At some point (maybe in 2030) we will reach a level where AI is as capable as humans in man... (read more)

5Daniel Kokotajlo3mo
For pre-AGI systems, I agree that it's going to take a substantial amount of time for them to be deployed as products and transform the economy. Which is why my What 2026 Looks Like story doesn't have GDP even begin to accelerate by 2026. For post-AGI systems, I forecast a fairly quick FOOM period (less than a year probably? Weeks or days, possibly) during which we keep basically the same hardware, but the "software" improves dramatically. We end up with something like von Neumann's brain, but qualitatively better in every way, and also cheaper and faster, and with a huge population size (millions? Billions?). They do more quality-adjusted intellectual labor in a few hours than all of human science could do in decades. They figure out a plan for dramatically better hardware, dramatically better robots, etc. and then from then on the human economy works feverishly to follow The Plan as fast as possible, instructed by superintelligent AI overseers. I think maybe it'll take only a few years at most to get to crazy nanobot tech (or some other kind of fully automated industry). Could take only a few days potentially.  

Well maybe. I still think it's easier to build AGI than to understand the brain, so even the smartest narrow AIs might not be able to build a consistent theory before someone else builds AGI. 

I'm not very bullish on HMI. I think the progress humanity makes in understanding the brain is extremely slow and because it's so hard to do research on the brain, I don't expect us to get much faster. 

Basically, I expect humanity to build AGI way before we are even close to understanding the brain. 

Well, narrow superintelligent AIs might help us understand the brain before then.

I know your reasoning and I think it's a plausible possibility. I'd be interested in how the disruption of AI into society looks like in your scenario. 

Is it more like one or a few companies have AGIs but the rest of the world is still kinda normal or is it roughly like my story just 2x as fast?

Thanks for pointing this out. I made a clarification in the text. 

Similar to Daniel, I'd also be interested in what public opinion was at the time or what the consensus view among experts was if there was one. 

Also, it seems like the timeframe for mobile phones is 1993 to 2020 if you can trust this statistic.

The timeframe for mobile phones is actually 1993 to 2013, mobile phone ownership went from 20% to 50% from 2010 to 2013.

Definitely could be but don't have to be. We looked a bit into cooling and heat and did not find any clear consensus on the issue. 

We did consider modeling it explicitly. However, most estimates on the Landauer limit give very similar predictions as size limits. So we decided against making an explicit addition to the model and it is "implicitly" modeled in the physical size. We intend to look into Landauer's limit at some point but it's not a high priority right now. 

We originally wanted to forecast FLOP/s/$ instead of just FLOP/s but we found it hard to make estimates about price developments. We might look into this in the future. 

Thanks. Another naive question: how do power and cooling requirements scale with transistor and GPU sizes? Could these be barriers to how large supercomputers can be built in practice?

Well, depending on who you ask, you'll get numbers between 1e13 and 1e18 for the human brain FLOP/s equivalent. So I think there is lots of uncertainty about it. 

However, I do agree that if it was at 1e16, your reasoning sounds plausible to me. What a wild imagination. 

Jacob Cannell thinks brains are doing 10^14 10^15 FLOP/s. []

Yeah, I also expect that there are some ways of compensating for the lack of miniaturization with other tech. I don't think progress will literally come to a halt. 

We looked more into this because we wanted to get a better understanding of Ajeya's estimate of price-performance doubling every 2.5 years. Originally, some people I talked to were skeptical and thought that 2.5 years is too conservative. I now think that 2.5 years is probably insufficiently conservative in the long run. 

However, I also want to note that there are still reasons to believe a doubling time of 2 years or less could be realistic due to progress in specialization or other breakthroughs. I still have large uncertainty about the doubling tim... (read more)

By uncertainty I mean, I really don't know, i.e. I could imagine both very high and very low gains. I didn't want to express that I'm skeptical. 

For the third paragraph, I guess it depends on what you think of as specialized hardware. If you think GPUs are specialized hardware than a gain of 1000x from CPUs to GPUs sounds very plausible to me. If you think GPUs are the baseline and specialized hardware are e.g. TPUs, then a 1000x gain sounds implausible to me. 

My original answers wasn't that clear. Does this make more sense to you?

It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)

As a follow-up to building the model, I was looking into specialized AI hardware and I have to say that I'm very uncertain about the claimed efficiency gains. There are some parts of the AI training pipeline that could be improved with specialized hardware but others seem to be pretty close to their limits. 

We intend to understand this better and publish a piece in the future but it's currently not high on the priority list. 

Also, when compared to CPUs, it's no wonder that any parallelized hardware is 1000x more efficient. So it really depends on what the exact comparison is that the authors used. 

Sorry, I'm a bit confused. I'm interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I'm probably misinterpreting part of your response?

Just send in a new application. We have a couple of new mentors but they are quite busy at the moment so I can't promise that we'll find a match soon :( Sorry for that 

I'm not sure I actually understand the distinction between forecasting and foresight. For me, most of the problems you describe sound either like forecasting questions or AI strategy questions that rely on some forecast. 

Your two arguments for why foresight is different than forecasting are 
a) some people think forecasting means only long-term predictions and some people think it means only short-term predictions. 
My understanding of forecasting is that it is not time-dependent, e.g. I can make forecasts about an hour from now or for a milli... (read more)

So, my difficulty is that my experience in government and my experience in EA-adjacent spaces has totally confused my understanding of the jargon. I'll try to clarify: * In the context of my government experience, forecasting is explicitly trying to predict what will happen based on past data. It does not fully account for fundamental assumptions that might break due to advances in a field, changes in geopolitics, etc. Forecasts are typically used to inform one decision. It does not focus on being robust across potential futures or try to identify opportunities we can take to change the future. * In EA / AGI Risk, it seems that people are using "forecasting" to mean something somewhat like foresight, but not really? Like, if you go on Metaculus, they are making long-term forecasts in a superforecaster-mindset, but are perhaps expecting their long-term forecasts are as good as the short-term forecasts. I don't mean to sound harsh, it's useful what they are doing and can still feed into a robust plan for different scenarios. However, I'd say what is mentioned in reports typically does lean a bit more into (what I'd consider) foresight territory sometimes. * My hope: instead of only using "forecasts/foresight" to figure out when AGI will happen, we use it to identify risks for the community, potential yellow/red light signals, and golden opportunities where we can effectively implement policies/regulations. In my opinion, using a "strategic foresight" approach enables us to be a lot more prepared for different scenarios (and might even have identified a risk like SBF much sooner). Yes, in the end, we still need to prioritize based on the plausibility of a scenario. Yeah, I care much less about the term/jargon than the approach. In other words, what I'm hoping to see more of is to come up with a set of scenarios and forecasting across the cone of plausibility (weighted by probability, impact, etc) so that we can c

I have not tested it since then. I think there were multiple projects that tried to improve profilers for PyTorch. I don't know how they went.

If it's easier for you, we can already facilitate that through M&M. Like we said, as long as both parties agree, you can do whatever makes sense for you :) But the program might make finding other people easier.

AI policy does count and we actively look for mentors in policy :)

I'm honestly not even sure whether this comment is in support of or against my disagreements. 

I'm skeptical of the "recursive self-improvement leads to enormous gains in intelligence over days" story but I support the "more automation leads to faster R&D leads to more automation, etc." story which is also a form of recursive self-improvement-just over the span of years rather than days. 

2Gerald Monroe4mo
Recursion is an empirical fact.  It works numerous places.  Computing itself that led to AI was recursive - the tools that allow for high end/high performance silicon chips require previous generations of those chips to function.  (to be fair the coupling is loose, Intel could design better chips using 5-10 year old desktops and servers and some of their equipment is that outdated) Electric power generator production for all currently used types require electric power from prior generators, etc.   But yeah it so far hasn't happened in days.

I think dangerous doesn't have to imply that it can replicate itself. Doing zero-shot shenanigans on the stock market causing a recession and then being turned off also counts as dangerous in my books. 

Updated the slack link. Thanks for spotting it. 

I'm not certain about it either but I'm less skeptical. However, I agree with you that some of this could be capabilities work and has to be treated with caution. 

However, I think to answer some of the important questions around Deep Learning, e.g. which concepts they learn and under which conditions, we just need to get a better understanding of the entire pipeline. I think it's plausible that this is very hard and progress is much slower than one would hope. 

Just to be clear, I also think that your grokking work increases alignment much more than capabilities on balance. 

I think the way in which it increases capabilities would roughly look like this: "your insight on grokking is a key to understanding fast generalization better; other people build on this insight and then modify training; this improves the speed of learning and thus capabilities". 

I think your work is clearly net positive, I just wanted to use a concrete example in the post to show that there are trade-offs worth taking. 

Maybe I should have stated this differently in the post. Many conversations end up talking about X-risks at some point but usually only after it went through the other stuff. I think my main learning was just that starting with X-risk as the motivation did not seem very convincing.

Also, there is a big difference in how you talk about X-risk. You could say stuff like "there are plausible arguments why X-risk could lead to extinction but even experts are highly uncertain about this" or "We're all gonna die" and the more moderate version seems clearly more persuasive. 

I don't think these conversations had as much impact as you suggest and I think most of the stuff funded by EA funders has decent EV, i.e. I have more trust in the funding process than you seem to have.  

I think one nice side-effect of this is that I'm now widely known as "the AI safety guy" in parts of the European AIS community and some people have just randomly dropped me a message or started a conversation about it because they were curious.

I was working on different grants in the past but this particular work was not funded. 

I was thinking about this post, not the conversations at all. If this post had existed 6 years ago, your conversations probably would have had much more impact than they did. It's a really good idea for people to know how to succeed instead of fail at explaining AI safety to someone for the first time, even if this was just academics. This post should be distilled so even more people can see it, and possibly even distributed as zines or a guidebook, e.g. at some of the AI safety events such as in DC.

I think taking safety seriously was strongly correlated with whether or not they believed AI will be transformative in the near-term (as opposed to being just another hype cycle). But not sure. My sample is too small to make any general inferences. 

This seems to be what is working on if I understood their website correctly. They probably don't have a million users yet though so I agree that it is not accurate at the moment. 

Also, is an LLM with access to the internet, right? 

But yeah, probably an overstatement. 

Firstly, I don't think the term matters that much. Whether you use AGI safety,  AI safety, ML safety, etc. doesn't seem to have as much of an effect compared to the actual arguments you make during the conversation (at least that was my impression). 

Secondly, I don't say you should never talk about x-risk. I mostly say you shouldn't start with it. Many of my conversations ended up in discussions of X-risk but only after 30 minutes of back and forth. 

Not really. You can point out that this is your reasoning but whenever you talk about short timelines you can bet that most people will think you're crazy. To be fair, even most people in the alignment community are more confident that 2040 will exist than not, so this is even a controversial statement within AI safety. 

Huh, I misremembered Cotra's update [], and wrt that Metaculus question that got retitled [] due to the AI effect [], I can see most people thinking it resolves long before history ends.

Thanks for all the clarifications and the notebook. I'll definitely play around with this :) 

Yeah, our impression was that a) there is a large body of literature that is relevant and related in the existing social science literature, and b) taking 90% of the existing setup and adding AI would probably already yield lots of interesting studies. In general, it seems like there is a lot of room for people interested in the intersection of AI+ethics+social sciences.

Also, Positly+Guidedtrack makes running these studies really simple and turned out to be much smoother than I would have expected. So even when people without a social science background "just want to get a rough understanding what the rest of the world thinks" they can quickly do so with the existing tools. 

I think this is very similar to the hypothesis they have as well. But not sure if I understood it correctly, I think some parts of the paper are not as clear as they could be

Update: it's published now and you can find it here:

Thanks for the post. I'm voting for the SGD inductive biases for the next one.

That's fair. I think my statement as presented is much stronger than I intended it to be. I'll add some updates/clarifications at the top of the post later.

Thanks for the feedback!

Thanks for the comment. It definitely pointed out some things that weren't clear in my post and head. Some comments:
1. I think your section on psychologizing is fairly accurate. I previously didn't spend a lot of time thinking about how my research would reduce the risks I care about and my theories of change were pretty vague. I plan to change that now. 
2. I am aware of other failure modes such as fast takeoffs, capability gains in deployment, getting what we measure, etc. However, I feel like all of these scenarios get much worse/harder when decepti... (read more)

Thanks for the clarifications. They helped me :)

I think the main difference is that it has access to all of the user's computers (or at least browsers), right? This should imply way more opportunities for malicious actions, right? 

6Lone Pine6mo
We need to reopen the debate on AI boxing and ask if this is something we want to oppose. I wrote about this in the basic AI safety questions thread. []

Yeah, the phrasing might not be as precise as we intended it to be. 

My tentative heuristic for whether you should publish a post that is potentially infohazardy is "Has company-X-who-cares-mostly-about-capabilities likely thought about this already?". It's obviously non-trivial to answer that question but I'm pretty sure most companies who build LLMs have looked at Chinchilla and come to similar conclusions as this post. In case you're unsure, write up the post in a google doc and ask someone who has thought more about infohazards whether they would publish it or not. 

Also, I think Leon underestimates how fast a post can spread even if it is just intended for an alignment audience on LW. 

Load More