## LESSWRONGLW

LawrenceC

I do AI Alignment research. Currently at ARC Evals, though I still dabble in grantmaking and interpretability in my spare time.

I'm also currently on leave from my PhD at UC Berkeley's CHAI.

Obligatory research billboard website: https://chanlawrence.me/

# Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

# Wiki Contributions

Yeah, "strongest" doesn't mean "strong" here!

LawrenceC1moΩ220

I mean, yeah, as your footnote says:

Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized.[1]

Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice?

1. ^

[C]ompactly represent  (f composed with g) in a way that makes computing  more efficient for general choices of  and .

As an aside, I actually can't think of any class of interesting functions with this property -- when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)

LawrenceC1moΩ305218

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck.

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

## A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.

• RNNs are easy to sample from — to compute the logit for x_t+1, you only need the most recent hidden state h_t and the last token x_t, which means it’s both fast and memory efficient. RNNs generate a sequence of length L with O(1) memory and O(L) time. However, they’re super hard to train, because  you need to sequentially generate all the hidden states and then (reverse) sequentially calculate the gradients. The way you actually did this is called backpropogation through time — you basically unroll the RNN over time — which requires constructing a graph of depth equal to the sequence length. Not only was this slow, but the graph being so deep caused vanishing/exploding gradients without careful normalization. The strategy that people used was to train on short sequences and finetune on longer ones. That being said, in practice, this meant you couldn’t train on long sequences (>a few hundred tokens) at all. The best LSTMs for modeling raw audio could only handle being trained on ~5s of speech, if you chunk up the data into 25ms segments.
• Conv Nets had a fixed receptive field size and pattern, so weren’t that suited for long sequence  modeling. Also, generating each token takes O(L) time, assuming the receptive field is about the same size as the sequence. But they had significantly more stability (the depth was small, and could be as low as O(log(L))), which meant you could train them a lot easier. (Also, you could use FFT to efficiently compute the conv, meaning it trains one sequence in O(L log(L)) time.) That being said, you still couldn’t make them that big. The most impressive example was DeepMind’s WaveNet, conv net used to model human speech, and could handle up sequences up to 4800 samples … which was 0.3s of actual speech at 16k samples/second (note that most audio is sampled at 44k samples/second…), and even to to get to that amount, they had to really gimp the model’s ability to focus on particular inputs.
• Transformers are easy to train, can handle variable length sequences, and also allow the model to “decide” which tokens it should pay attention to. In addition to both being parallelizable and having relatively shallow computation graphs (like conv nets), you could do the RNN trick of pretraining on short sequences and then finetune on longer sequences to save even more compute. Transformers could be trained with comparable sequence length to conv nets but get much better performance; for example, OpenAI’s musenet was trained on sequence length 4096 sequences of MIDI files. But as we all know, transformers have the unfortunate downside of being expensive to sample from — it takes O(L) time and O(L) memory to generate a single token (!).

The better performance of transformers over conv nets and their ability to handle variable length data let them win out.

That being said, people have been trying to get around the O(L) time and memory requirements for transformers since basically their inception. For a while, people were super into sparse or linear attention of various kinds, which could reduce the per-token compute/memory requirements to O(log(L)) or O(1).

## The what and why of Mamba

If the input -> hidden and hidden -> hidden map for RNNs were linear (h_t+1 = A h_t + B x_t), then it’d be possible to train an entire sequence in parallel — this is because you can just … compose the transformation with itself (computing A^k for k in 2…L-1) a bunch, and effectively unroll the graph with the convolutional kernel defined by A B, A^2 B, A^3 B, … A^{L-1} B. Not only can you FFT during training to get the O(L log (L)) time of a conv net forward/backward pass (as opposed to O(L^2) for the transformer), you still keep the O(1) sampling time/memory of the RNN!

The problem is that linear hidden state dynamics are kinda boring. For example, you can’t even learn to update your existing hidden state in a different way if you see particular tokens! And indeed, previous results gave scaling laws that were much worse than transformers in terms of performance/training compute.

In Mamba, you basically learn a time varying A and B. The parameterization is a bit wonky here, because of historical reasons, but it goes something like: A_t is exp(-\delta(x_t) * exp(A)), B_t = \delta(x_t) B x_t, where \delta(x_t) = softplus ( W_\delta x_t). Also note that in Mamba, they also constrain A to be diagonal and W_\delta to be low rank, for computational reasons

Since exp(A) is diagonal and has only positive entries, we can interpret the model as follows: \delta controls how much to “learn” from the current example — with high \delta, A_t approaches 0 and B_t is large, causing h_t+1 ~= B_t x_t, while with \delta approaching 0, A_t approaches 1 and B_t approaches 0, meaning h_t+1 ~= h_t.

Now, you can’t exactly unroll the hidden state as a convolution with a predefined convolution kernel anymore, but you can still efficiently compute the implied “convolution” using parallel scanning.

Despite being much cheaper to sample from, Mamba matches the pretraining flops efficiency of modern transformers (Transformer++ = the current SOTA open source Transformer with RMSNorm, a better learning rate schedule, and corrected AdamW hyperparameters, etc.). And on a toy induction task, it generalizes to much longer sequences than it was trained on.

## So, about those capability externalities from mech interp...

Yes, those are the same induction heads from the Anthropic ICL paper!

Like the previous Hippo and Hyena papers they cite mech interp as one of their inspirations, in that it inspired them to think about what the linear hidden state model could not model and how to fix that. I still don’t think mech interp has that much Shapley here (the idea of studying how models perform toy tasks is not new, and the authors don't even use induction metric or RRT task from the Olsson et al paper), but I'm not super sure on this.

IMO, this is line of work is the strongest argument for mech interp (or maybe interp in general) having concrete capabilities externalities. In addition, I think the previous argument Neel and I gave of "these advances are extremely unlikely to improve frontier models" feels substantially weaker now.

## Is this a big deal?

I don't know, tbh.

That seems correct, at least directionally, yes.

I don't want to say things that have any chance of annoying METR without checking with METR comm people, and I don't think it's worth their time to check the things I wanted to say.

I'm not sure your results really support the interpretation that davinci "transfers less well". Notably, achieving 100% accuracy from 50% is often a lot harder than achieving 50% from 0%/whatever random chance is on your datasets (I haven't looked through your code yet to examine the datasets) and I'd predict that davinci already does pretty well zero-shot (w/ no finetuning) on most of the tasks you consider here (which limits its improvement from finetuning, as you can't get above 100% accuracy).

In addition, larger LMs are often significantly more data efficient, so you'd predict that they need less total finetuning to do well on tasks (and therefore the additional finetuning on related tasks would benefit the larger models less).

LawrenceC1mo1016

# How my views on AI(S) have changed over the last 5.5 years

This was shamelessly copied from directly inspired by Erik Jenner's "How my views on AI have changed over the last 1.5 years". I think my views when I started my PhD in Fall 2018 look a lot worse than Erik's when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.

Apologies for the disorganized bullet points. If I had more time I would've written a shorter shortform.

## AI Capabilities/Development

Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines.

• In Fall of 2018, my model of how AGI might happen was substantially influenced by AlphaGo/Zero, which features explicit internal search. I expected future AIs to also feature explicit internal search over world models, and be trained mainly via reinforcement learning or IDA. I became more uncertain after OpenAI 5 (~May 2018), which used no clever techniques and just featured BPTT being ran on large LSTMs.
• That being said, I did not believe in the scaling hypothesis -- that is, that simply training larger models on more inputs would continually improve performance until we see "intelligent behavior" -- until GPT-2 (2019), despite encountering it significantly earlier (e.g. with OpenAI 5, or speaking to OAI people).
• In particular, I believed that we needed many "key insights" about intelligence before we could make AGI. This both gave me longer timelines and also made me believe more in fast take-off.
• I used to believe pretty strongly in MIRI-style fast take-off (e.g. would've assigned <30% credence that we see a 4 year period with the economy doubling) as opposed to (what was called at the time) Paul-style slow take-off. Given the way the world has turned out, I have updated substantially. While I don't think that AI development will be particularly smooth, I do expect it to be somewhat incremental, and I also expect earlier AIs to provide significantly more value even before truly transformative aI.
• -- Some beliefs about AI Scaling Labs that I'm redacting on LW --
• My timelines are significantly shorter -- I would've probably said median 2050-60 in 2018, but now I think we will probably reach human-level AI by 2035.

## AI X-Risk

Summary: I have become more optimistic about AI X-risk, but my understanding has become more nuanced.

• My P(Doom) has substantially decreased, especially P(Doom) attributable to an AI directly killing all of humanity. This is somewhat due to having more faith that many people will be reasonable (in 2018, there were maybe ~20 FTE AIS researchers, now there are probably something like 300-1000 depending on how you count), somewhat due to believing that governance efforts may successfully slow down AGI substantially, and somewhat due to an increased belief that "winging-it"--style, "unprincipled" solutions can scale to powerful AIs.
• That being said, I'm less sure about what P(Doom) means. In 2018, I imagined the main outcomes were either "unaligned AGI instantly defeats all of humanity" and "a pure post-scarcity utopia". I now believe in a much wider variety of outcomes.
• For example, I've become more convinced both that misuse risk is larger than I thought, and that even weirder outcomes are possible (e.g. the AI keeps human (brain scans) around due to trade reasons). The former is in large part related to my belief in fast take-off being somewhat contradicted by world events; now there is more time for powerful AIs to be misused.
• I used to think that solving the technical problem of AI alignment would be necessary/sufficient to prevent AI x-risk. I now think that we're unlikely to "solve alignment" in a way that leads to the ability to deploy a powerful Sovereign AI (without AI assistance), and also that governance solutions both can be helpful and are required.

## AI Safety Research

Summary: I've updated slightly downwards on the value of conceptual work and significantly upwards on the value of fast empirical feedback cycles. I've become more bullish on (mech) interp, automated alignment research, and behavioral capability evaluations.

• In Fall 2018, I used to think that IRL for ambitious value learning was one of the most important problems to work on. I no longer think so, and think that most of my work on this topic was basically useless.
• In terms of more prosaic IRL problems, I very much lived in a frame of "the reward models are too dumb to understand" (a standard academic take) . I didn't think much about issues of ontology identification or (malign) partial observability.
• I thought that academic ML theory had a decent chance of being useful for alignment. I think it's basically been pretty useless in the past 5.5 years, and no longer think the chances of it being helpful "in time" are enough. It's not clear how much of this is because the AIS community did not really know about the academic ML theory work, but man, the bounds turned out to be pretty vacuous, and empirical work turned out far more informative than pure theory work.
• I still think that conceptual work is undervalued in ML, but my prototypical good conceptual work looks less like "prove really hard theorems" or "think about philosophy" and a lot more like "do lots of cheap and quick experiments/proof sketches to get grounding".
• Relatedly, I used to dismiss simple techniques for AI Alignment that try "the obvious thing". While I don't think these techniques will scale (or even necessarily work well on current AIs), this strategy has turned out to be significantly better in practice than I thought.
• My error bars around the value of reading academic literature have shrunk significantly (in large part due to reading a lot of it). I've updated significantly upwards on "the academic literature will probably contain some relevant insights" and downwards on "the missing component of all of AGI safety can be found in a paper from 1983".
• I used to think that interpretability of deep neural networks was probably infeasible to achieve "in time" if not "actually impossible" (especially mechanistic interpretability). Now I'm pretty uncertain about its feasibility.
• Similarly, I used to think that having AIs automate substantial amounts of alignment research was not possible. Now I think that most plans with a shot of successfully preventing AGI x-risk will feature substantial amounts of AI.
• I used to think that behavioral evaluations in general would be basically useless for AGIs. I now think that dangerous capability evaluations can serve as an important governance tool.

Summary: I've better identified my comparative advantages, and have a healthier way of relating to AIS research.

• I used to think that my comparative advantage was clearly going to be in doing the actual technical thinking or theorem proving. In fact, I used to believe that I was unsuited for both technical writing and pushing projects over the finish line. Now I think that most of my value in the past ~2 years has come from technical writing or by helping finish projects.
• I used to think that pure engineering or mathematical skill were what mattered, and feel sad about how it seemed that my comparative advantage was something akin to long term memory.[1] I now see more value in having good long-term memory.
• I used to be uncertain about if academia was a good place for me to do research. Now I'm pretty confident it's not.
• Embarrassingly enough, in 2018 I used to implicitly believe quite strongly in a binary model of "you're good enough to do research" vs "you're not good enough to do research". In addition, I had an implicit model that the only people "good enough" were those who never failed at any evaluation. I no longer think this is true.
• I am more of a fan of trying obvious approaches or "just doing the thing".

1. ^

I think, compared to the people around me, I don't actually have that much "raw compute" or even short term memory (e.g. I do pretty poorly on IQ tests or novel math puzzles), and am able to perform at a much higher level by pattern matching and amortizing thinking using good long-term memory (if not outsourcing it entirely by quoting other people's writing).

LawrenceC1moΩ440

Right, the step I missed on was that P(X|Y) = P(X|Z) for all y, z implies P(X|Z) = P(X). Thanks!

LawrenceC1moΩ220

Hm, it sounds like you're claiming that if each pair of x, y, z are pairwise independent conditioned on the third variable, and p(x, y, z) =/= 0 for all x, y, z with nonzero p(x), p(y), p(z), then ?

I tried for a bit to show this but couldn't prove it, let alone the general case without strong invariance. My guess is I'm probably missing something really obvious.

I agree that GSM8K has been pretty saturated (for the best frontier models) since ~GPT-4, and GPQA is designed to be a hard-to-saturated benchmark (though given the pace of progress...).

But why are HumanEval and MMLU also considered saturated? E.g. Opus and 4-Turbo are both significantly better than all other publicly known models on both benchmarks on both.  And at least for HumanEval, I don't see why >95% accuracy isn't feasible.

It seems plausible that MMLU/HumanEval could be saturated after GPT-4.5 or Gemini 1.5 Ultra, at least for the best frontier models. And it seems fairly likely we'll see them saturated in 2-3 years. But it seems like a stretch to call them saturated right now.

Is the reasoning for this is that Opus gets only 0.4% better on MMLU than the March GPT-4? That seems like pretty invalid reasoning, akin to deducing that because two runners achieve the same time, that that time is the best human-achievable time. And this doesn't apply to HumanEval, where Opus gets ~18% better than March GPT-4 and the November 4-Turbo gets 2.9% better than Opus.