What happens when a learned model (such as a neural network) is itself an optimizer? The possibility of mesa-optimization raises two important questions for the safety and transparency of advanced ML systems. First, under what circumstances will models be optimizers, including when they should not be? And, when a model is an optimizer, what will its objective be?

LawrenceC13hΩ213417
5
I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck.  TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from. A brief historical digression As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.   * RNNs are easy to sample from — to compute the logit for x_t+1, you only need to store the most recent hidden state h_t and the last token x_t, which means it’s both fast and memory efficient. RNNs generate a sequence of length L with O(1) memory and O(L) time. However, they’re super hard to train, because  you need to sequentially generate all the hidden states and then (reverse) sequentially calculate. The way you actually did this is called backpropogation through time — you basically unroll the RNN over time — which requires constructing a graph of depth equal to the sequence length. Not only was this slow, but the graph being so deep caused vanishing/exploding gradients without careful normalization. The strategy that people used was to train on short sequences and finetune on longer ones. That being said, in practice, this meant you couldn’t train on long sequences (>a few hundred tokens) at all. The best LSTMs for modeling raw audio could only handle being trained on ~5s of speech, if you chunk up the data into 25ms segments. * Conv Nets had a fixed receptive field size and pattern, so weren’t that suited for long sequence  modeling. Also, generating each token takes O(L) time, assuming the receptive field is about the same size as the sequence. But they had significantly more stability (the depth was small, and could be as low as O(log(L))), which meant you could train them a lot easier. (Also, you could use FFT to efficiently compute the conv, meaning it trains one sequence in O(L log(L)) time.) That being said, you still couldn’t make them that big. The most impressive example was DeepMind’s WaveNet, conv net used to model human speech, and could handle up sequences up to 4800 samples … which was 0.3s of actual speech at 44000 samples/second (note that most audio is sampled at 44k samples/second…), and even to to get to that amount, they had to really gimp the model’s ability to focus on particular inputs. * Transformers are easy to train, can handle variable length sequences, and also allow the model to “decide” which tokens it should pay attention to. In addition to both being parallelizable and having relatively shallow computation graphs (like conv nets), you could do the RNN trick of pretraining on short sequences and then finetune on longer sequences to save even more compute. Transformers could be trained with comparable sequence length to conv nets but get much better performance; for example, OpenAI’s musenet was trained on sequence length 4096 sequences of MIDI files. But as we all know, transformers have the unfortunate downside of being expensive to sample from — it takes O(L) time and O(L) memory to generate a single token (!). The better performance of transformers over conv nets and their ability to handle variable length data let them win out. That being said, people have been trying to get around the O(L) time and memory requirements for transformers since basically their inception. For a while, people were super into sparse or linear attention of various kinds, which could reduce the per-token compute/memory requirements to O(log(L)) or O(1). The what and why of Mamba If the input -> hidden and hidden -> hidden map for RNNs were linear (h_t+1 = A h_t + B x_t), then it’d be possible to train an entire sequence in parallel — this is because you can just … compose the transformation with itself (computing A^k for k in 2…L-1) a bunch, and effectively unroll the graph with the convolutional kernel defined by A B, A^2 B, A^3 B, … A^{L-1} B. Not only can you FFT for during training to get the O(L log (L)) time of a conv net forward/backward pass (as opposed to O(L^2) for the transformer), you still keep the O(1) sampling time/memory of the RNN! The problem is that linear hidden state dynamics are kinda boring. For example, you can’t even learn to update your existing hidden state in a different way if you see particular tokens! And indeed, previous results gave scaling laws that were much worse than transformers in terms of performance/training compute.  In Mamba, you basically learn a time varying A and B. The parameterization is a bit wonky here, because of historical reasons, but it goes something like: A_t is exp(-\delta(x_t) * exp(A)), B_t = \delta(x_t) B x_t, where \delta(x_t) = softplus ( W_\delta x_t). Also note that in Mamba, they also constrain A to be diagonal and W_\delta to be low rank, for computational reasons. Since exp(A) is diagonal and has only positive entries, we can interpret the model as follows: \delta controls how much to “learn” from the current example — with high \delta, A_t approaches 0 and B_t is large, causing h_t+1 ~= B_t h_t, while with \delta approaching 0, A_t approaches 1 and B_t approaches 0, meaning h_t+1 ~= h_t. Now, you can’t exactly unroll the hidden state as a convolution with a predefined convolution kernel anymore, but you can still efficiently compute the implied “convolution” using parallel scanning. Despite being much cheaper to sample from, Mamba matches the pretraining flops efficiency of modern transformers (Transformer++ = the current SOTA open source Transformer with RMSNorm, a better learning rate schedule, and corrected AdamW hyperparameters, etc.). And on a toy induction task, it generalizes to much longer sequences than it was trained on. So, about those capability externalities from mech interp... Yes, those are the same induction heads from the Anthropic ICL paper!  Like the previous Hippo and Hyena papers they cite mech interp as one of their inspirations, in that it inspired them to think about what the linear hidden state model could not model and how to fix that. I still don’t think mech interp has that much Shapley here (the idea of studying how models perform toy tasks is not new, and the authors don't even use induction metric or RRT task from the Olsson et al paper), but I'm not super sure on this. IMO, this is line of work is the strongest argument for mech interp (or maybe interp in general) having concrete capabilities externalities. In addition, I think the previous argument Neel and I gave of "these advances are extremely unlikely to improve frontier models" feels substantially weaker now. Is this a big deal? I don't know, tbh.
I don't think this is the case, but I'm mentioning this possibility because I'm surprised I've never seen someone suggest it before: Maybe the reason Sam Altman is taking decisions that increase p(doom) is because he's a pure negative utilitarian (and he doesn't know-about/believe-in acausal trade).
gwern3dΩ276226
6
Warning for anyone who has ever interacted with "robosucka" or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i
Robin Hanson has been writing regularly, at about the same quality for almost 20 years. Tyler Cowen too, but personally Robin has been much more influential intellectually for me. It is actually really surprising how little his insights have degraded via return-to-the-mean effects. Anyone else like this?
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent? One answer, which Yudkowsky gives here, is that conscious experiences are just a "weird and more abstract and complicated pattern that matter can be squiggled into". But that seems to be in tension with another claim he makes, that there's no way for one agent's conscious experiences to become "more real" except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse. Clearly a squiggle-maximizer would not be an average squigglean. So what's the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you're deciding for the good of A) you pick whichever one gives A a better time on average. Yudkowsky has written before (can't find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he's a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky's preferences are incoherent, and that the only coherent thing to do here is to "expect to be" a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we're at the hinge of history.) But this is just an answer, it doesn't dissolve the problem. What could? Some wild guesses: 1. You are allowed to have preferences about the external world, and you are allowed to have preferences about your "thread of experience"—you're just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse. 2. Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you're not allowed to be both (on pain of incoherence/being dutch-booked). 3. Something totally different: the problem here is that we don't have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep. 4. Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.

Popular Comments

Recent Discussion

1james.lucassen8h
I'm confused about EDT and Smoking Lesion. The canonical argument says: 1) CDT will smoke, because smoking can't cause you to have the lesion or have cancer 2) EDT will not smoke, because people who smoke tend to have the lesion, and tend to have cancer. I'm confused about 2), and specifically about "people who smoke tend to have the lesion". Say I live in a country where everybody follows EDT. Then nobody smokes, and there is no correlation between smoking and having the lesion. Seems like the "people who smoke tend to have the lesion" is pumping a misleading intuition. Maybe mortals who make probabilistic decisions and smoke are then more likely to have the lesion. But it's not true that EDT'ers who smoke tend to have the lesion, because EDT'ers never smoke. Seems like EDT contains a conflict between "model your actions as probabilistic and so informative about lesion status" versus "model your actions as deterministic because you perfectly follow EDT and so not informative about lesion status".  Aha wait I think I un-confused myself in the process of writing this. If we suppose that having the lesion guarantees you'll smoke, then having the lesion must force you to not be an EDT'er, because EDT'ers never smoke. So the argument kind of has to be phrased in terms of "people who smoke" instead of "EDT'ers who smoke", which sounds misleading at first, like it's tricking you with the wrong reference class for the correlation. But actually it's essential, because "EDT'ers who smoke" is a contradiction, and "people who smoke" roughly translates to "decision theories which smoke". With a deterministic decision theory you can't take action counterfactuals without a contradiction, so you have to counterfact on your whole decision theory instead. Neat :)

I've long been confused about the Smoking Lesion problem for a completely different and probably unimportant reason.  I don't understand the combination of "should you prefer" and "preference is caused by lesion".  What does "should" mean in this case?  Preferences over counterfactual preferences is just weird.  

1quetzal_rainbow7h
My personal model is "if you have lesion, with some small probability it takes over your mind and you smoke anyway, also you can't distinguish whether your decision is due to lesion"

It seems like logically updateless reasoning is what we would want in order to solve many decision-theory problems. I show that several of the problems which seem to require updateless reasoning can instead be solved by selecting a policy with a logical inductor that's run a small amount of time. The policy specifies how to make use of knowledge from a logical inductor which is run longer. This addresses the difficulties which seem to block logically updateless decision theory in a fairly direct manner. On the other hand, it doesn't seem to hold much promise for the kind of insights which we would want from a real solution.


LI Policy Selection

Rather than running a logical inductor all the way to and making a decision via the expected

...
2Richard_Ngo10h
Is there a particular reason to think that the answer to this shouldn't just be "first run a logical inductor to P_f(f(n)), then use that distribution to determine how to use P_f(n) to determine how to choose an action from P_n" (at least for large enough n)?

It seems better in principle to find a way to respect human intuitions about which things to be updateless about. Getting something wrong in the too-updateless direction can give up control of the AI to entities which we don't think of as existing; getting something wrong in the too-updateful direction can miss out on multiverse-wide coorination via superrationality.

This is a linkpost for https://dynomight.net/axes/

Say you want to plot some data. You could just plot it by itself:

Or you could put lines on the left and bottom:

Or you could put lines everywhere:

Or you could be weird:

Which is right? Many people treat this as an aesthetic choice. But I’d like to suggest an unambiguous rule.

Principles

First, try to accept that all axis lines are optional. I promise that readers will recognize a plot even without lines around it.

So consider these plots:

Which is better? I claim this depends on what you’re plotting. To answer, mentally picture these arrows:

Now, ask yourself, are the lengths of these arrows meaningful? When you draw that horizontal line, you invite people to compare those lengths.

You use the same principle for deciding if you should draw a y-axis line. As...

2Gordon Seidoh Worley12h
Eh, I've encountered plenty of times when I really needed to understand the variance of data such that I had to "zoom in" and put the start of the axis at something above 0 because otherwise I couldn't find out what I needed to know to make a decision. But I do often like to see it both ways, so I can understand it both in relative and absolute terms.

Just so; the correct way is indeed to show the full (zero-based y-axis) chart, then a “zoomed-in” version, with the y-axis mapping clearly indicated. Of course, this takes more effort than just including the one chart; but this is not surprising—doing things correctly often takes more effort than doing things incorrectly!

1Rocket12h
This is the difference between a ratio scale and a cardinal scale. In a cardinal scale, the distance between points is meaningful, e.g., "The gap between 1 and 2 is twice as big as the gap between 2 and 4." In a ratio scale, there is a well-defined zero point, which means the ratios of points are also meaningful, e.g., "4 is twice as large as 2."

A while ago I wrote how I managed to add 13 points to my IQ (as measured by the mean between 4 different tests).

I had 3 “self-experimenters” follow my instructions in San Francisco. One of them dropped off, since, surprise surprise, the intervention is hard.

The other two had an increase of 11 and 10 points in IQ respectively (using the “fluid” components of each test) and an increase of 9 and 7 respectively if we include verbal IQ.

A total of 7 people acted as a control and were given advantages on the test compared to the intervention group to exacerbate the effects of memory and motivation, only 1 scored on par with the intervention group. We get a very good p-value, considering the small n, both when...

What evidence do you have about how much time it takes per day to maintain the effect after the end of the 2 weeks?

8Stephen Fowler3h
This is your second post and you're still being vague about the method. I'm updating strongly towards this being a hoax and I'm surprised people are taking you seriously. Edit: I'll offer you a 50 USD even money bet that your method won't replicate when tested by a 3rd party with more subjects and a proper control group.
4frankybegs5h
Sorry if I've missed something about this elsewhere, but is it possible to explain what it involves to people who aren't going to properly do it? I don't have 4+ hours a day to spare at the moment, nor $10k, but I'd love to know what the intervention involves so I can adopt as much of it as it is feasible to do (given it sounds like a multi-pronged intervention). Unless there's reason to think it only works as an all-or-nothing? Even just the supplements on their own sounds like they might be worth trying, otherwise.
3frankybegs4h
Apologies, I just read your reply to Joseph C. I would like to request the information, your reservations notwithstanding. I am happy to sign a liability waiver, or anything of that nature that would make you feel comfortable. I am also happy to share as much data as it is feasible to collect, and believe I could recruit at least some controls. As I mention above, I don't think I'll be able to implement the intervention in its entirety, given practical and resource constraints, but given your stated interest in a '1000 ships' approach this seems like it could be a positive for you.

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

You might find this link helpful for your questions. 

This is a link to the glossory from the above site.

This is from the FRB of St. Louis.

Last, I would suggest you can also just ask any of the available LLM's out there now to explain the term you are interested in and get a pretty good initial explanation.

As for books, I have three. How good they are is subjective as one textbook is from years ago but they should cover most of the investment markets side of things:

Options as a Strategic Investment (Lawrence McMIllan)

Technical Analysis (Kirkpatrick &am... (read more)

3CstineSublime14h
I don't see many particularly exotic finance terms, but I would think a recent edition of Brigham and Ehrhardt's "Financial Management: Theory and Practice" is probably the most authoritative (but not being available for 'preview' on Google Books I haven't been able to do a quick ctrl+f to see if it uses a sample of terms in that post). However I suspect that even one of those Tony Robbins books on investing will provide you the terminology or even Investopedia.

I did an exploration into how Community Notes (formerly Birdwatch) from X (formerly Twitter) works, and how its algorithm decides which notes get displayed to the wider community. In this post, I’ll share and explain what I found, as well as offer some comments. 
 

Community Notes is a fact-checking tool available to US-based users of X/Twitter which allows readers to attach notes to posts to give them clarifying context. It uses an open-source bridging-based ranking algorithm intended to promote notes which receive cross-partisan support, and demote notes with a strong partisan lean. The tool seems to be pretty popular overall, and most of the criticism aimed toward it seems to be about how Community Notes fails to be a sufficient replacement for other, more top-down moderation systems.[1] 


This seems interesting to me as an...

I'm surprised that it's one-dimensional as that should be relatively easy for the game. If the attacker cares about promoting Israeli interests or Chinese interests they can just cast a lot of votes in the other right/left direction on topics they don't care about. 

Did they write anywhere why they only consider one-dimension?

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I don't think this is the case, but I'm mentioning this possibility because I'm surprised I've never seen someone suggest it before:

Maybe the reason Sam Altman is taking decisions that increase p(doom) is because he's a pure negative utilitarian (and he doesn't know-about/believe-in acausal trade).

Seems relevant - RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval:

'Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with e... (read more)

4Vladimir_Nesov11h
StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be less important than this general point. (Possibly a window of precise attention gives the RNN layers many attempts at both storing and retrieving any given observation, so interspersing layes with even a relatively tiny window is sufficient to significantly improve on the more sloppy RNN-like model without any attention, whereas a pure RNN-like model would have to capture what it needs from a token in the exact step it appears, and then the opportunity is mostly lost. StripedHyena's attention wasn't sliding window, so context didn't scale any better, but with sliding window attention there are no context scaling implications from the attention layers.)
15ryan_greenblatt12h
Another key note about mamba is that despite being RNN like it doesn't result in substantially higher effective serial reasoning depth. This is because the state transition is linear[1]. However, it is architecturally closer to things that might involve effectively higher depth. See also here. ---------------------------------------- 1. And indeed, there is a fundamental tradeoff where if the state transition function is expressive (e.g. nonlinear), then it would no longer be possible to use a parallel scan because the intermediates for the scan would be too large to represent compactly or wouldn't simply the original functions. You can't compactly represent f∘g (f composed with g) in a way that makes computing f(g(x)) more efficient for general choices of f and g (in the typical MLP case at least). Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized (without imposing some constraints on the serial reasoning). ↩︎
2LawrenceC12h
I mean, yeah, as your footnote says: Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice?   1. ^ As an aside, I actually can't think of any class of interesting functions with this property -- when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)  

(Crossposted by habryka after asking Eliezer whether I could post it under his account)

i.

"Ignore all these elaborate, abstract, theoretical predictions," the Spokesperson for Ponzi Pyramid Incorporated said in a firm, reassuring tone.  "Empirically, everyone who's invested in Bernie Bankman has received back 144% of what they invested two years later."

"That's not how 'empiricism' works," said the Epistemologist.  "You're still making the assumption that --"

"You could only believe that something different would happen in the future, if you believed in elaborate theoretical analyses of Bernie Bankman's unobservable internal motives and internal finances," said the spokesperson for Ponzi Pyramid Incorporated.  "If you are a virtuous skeptic who doesn't trust in overcomplicated arguments, you'll believe that future investments will also pay back 144%, just like in the past.  That's the...

can't we just look at weights?

As I understand, interpretability research doesn't exactly got stuck, but it's very-very-very far from something like this even for not-SotA models. And the gap is growing.

4Eliezer Yudkowsky15h
Unless I'm greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it.  You'd just straightforwardly misunderstood my argument in that case, so it wasn't a long response, but I responded.  Asking for a second try is one thing, but I don't think it's cool to act like you never picked out any one item or I never responded to it. EDIT: I'm misremembering, it was Quintin's strongest point about the Bankless podcast.  https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=cr54ivfjndn6dxraD
4Eliezer Yudkowsky15h
If Quintin hasn't yelled "Empiricism!" then it's not about him.  This is more about (some) e/accs.
15Quintin Pope16h
FWIW, I don't think this is the main issue with the evolution analogy. The main issue is that evolution faced a series of basically insurmountable, yet evolution-specific, challenges in successfully generalizing human 'value alignment' to the modern environment, such as the fact that optimization over the genome can only influence within lifetime value formation theough insanely unstable Rube Goldberg-esque mechanisms that rely on steps like "successfully zero-shot directing an organism's online learning processes through novel environments via reward shaping", or the fact that accumulated lifetime value learning is mostly reset with each successive generation without massive fixed corpuses of human text / RLHF supervisors to act as an anchor against value drift, or evolution having a massive optimization power overhang in the inner loop of its optimization process.    These issues fully explain away the 'misalignment' humans have with IGF and other intergenerational value instability. If we imagine a deep learning optimization process with an equivalent structure to evolution, then we could easily predict similar stability issues would arise due to that unstable structure, without having to posit an additional "general tendency for inner misalignment" in arbitrary optimization processes, which is the conclusion that Yudkowsky and others typically invoke evolution to support.    In other words, the issues with evolution as an analogy have little to do with the goals we might ascribe to DL/evolutionary optimization processes, and everything to do with simple mechanistic differences in structure between those processes.