# All of Lucius Bushnaq's Comments + Replies

Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there's nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.

I'm honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That's either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?

Could a born blind human do this?

3Bezzi1d
With enough training, sure. There are such things as born blind human painters [https://en.wikipedia.org/wiki/E%C5%9Fref_Arma%C4%9Fan].
2DragonGod1d
There's a fuckton of descriptions of images in text I guess. And it's consumed trillions of tokens.
1JDockson1d
It's not just blind. It essentially has no senses whatsoever. It seems to have extrapolated "sense" from text data.

But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?

Possibly yes. I could easily see this underlying human preferences for regular patterns in art. Predictable enough to get a high score, not so predictable that whatever secondary boredom mechanism that keeps baby humans from maximising score by staring straight at the ceiling all day kicks in. I'm even getting suspiciou...

That separation between internal preferences and external behaviour is already implicit in Dutch books. Decision theory is about external behaviour, not internal representations. It talks about what agents do, not how agents work inside. As parts of decision theory, a preference, to them, is about something the system does or does not do in a given situation. When they talk about someone preferring pizza without pineapple, it's about that person paying money to not have pineapple on their pizza in some range of situations, not some definition related to computations about pineapples and pizzas in that person's brain.

I'd guess that the same structural properties that would make a network start out in the scarce channel regime by default would also make unintended channels rare. If the internal structure is such that very little information gets passed on unless you invest optimisation to make it otherwise, that same property should mean free rides are not common.

More central point, I'm a bit doubtful that this potential correspondence is all that important for understanding information transfer inside neural networks. Extant (A)GIs seem to have very few interface point...

To be frank, I have no idea what this is supposed to mean. If “make non-magical, humanlike systems” were actionable[1], there would not be much of an alignment problem. If this post is supposed to indicate that you think you have an idea for how to do this, but it's a secret, fine. But what is written here, by itself, sounds like a wish to me, not like a research agenda.

1. ^

Outside of getting pregnant, I suppose.

While funny, I think that tweet is perhaps a bit too plausible, and may be mistaken as having been aimed at statistical learning theorists for real, if a reader isn't familiar with its original context. Maybe flag that somehow?

1Jesse Hoogland1mo
Thanks for pointing this out!
2DragonGod1mo
+1. The joke is not immediately apparent out of context.

I don't typically imagine gradient hacking to be about mesa optimisers protecting themselves from erasure. Mesa optimisers are good at things. If you want to score well on a hard loss function involving diverse tricky problems, a mesa optimiser is often a great way of doing that. I do not think they would typically need to protect their optimisation circuitry from gradient descent.

Two prototypical examples of gradient hacking as I imagine it in my head are:

1. I have a terminal goal that doesn’t care about things related to scoring well on the loss function. B
...

Say our points  are the times of day measured by a clock.  And  are the temperatures measured by a thermometer at those times. We’re putting in times  in the early morning, where I decree temperature to increase roughly linearly as the sun rises.

You write the overparametrized regression model as  . Since our model doesn’t get to see the index, only the value of  itself, that has to implicitly be something like

Where ...

The other risk that could motivate not making this bet is the risk that the market – for some unspecified reason – never has a chance to correct, because (1) transformative AI ends up unaligned and (2) humanity’s conversion into paperclips occurs overnight. This would prevent the market from ever “waking up”.

You don't even need to expect it to occur overnight. It's enough for the market update to predictably occur so late that having lots of money available at that point is no longer useful. If AGI ends the world next week, there's not that ...

Interested, but depends on the cost. If I'm the only one who wants it, I'd be willing to pay \$30 to get the whole series, but probably not more. I don't know how long transcriptions usually take, but I'm guessing it'd certainly be >1h. So there'd need to be additional interest to make it worth it.

Epistemic status: sleep deprived musings

If I understand this right, this is starting to sound very testable.

Feed a neural network inputs consisting of variables . Configurations in a 2D Ising model, cat pictures, or anything else we humans think we know the latent variables for.

Train neural networks to output a set of variables  over the inputs. The loss function scores based on how much the output induces conditional independence of inputs over the training data set.

E.g., take the  divergence between...

While our current understanding of physics is predictably-wrong, it has no particular reason to be wrong in a way that is convenient for us.[1]

Meanwhile, more refined versions of some of the methods described here seem perhaps doable in principle, with sufficient technology.

You can make difficult things happen by trying hard at them. You can't violate the laws of physics by trying harder.

1. ^

Out of the many things that might be wrong about the current picture, impossibility of time travel is also one of the things I'd least expect to get overturne

...

This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.

I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?

You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.

4Quintin Pope5mo
My intuition: small changes to most parameters don’t influence behavior that much, especially if you’re in a flat basin. The local region in parameter space thus contains many possible small variations in model behavior. The behavior that solves the training data is similar to that which solves the test data, due to them being drawn from the same distribution. It’s thus likely that a nearby region in parameter space is a minima for the test data.

Our team has copied Lightcone's approach to communicating over Discord, and I've been very happy with it.

Aren't Standard Parametrisation and other parametrisations with a kernel limit commonly used mostly in cases where you're far away from reaching the depth-to-width≈0 limit, so expansions like the one derived for the NTK parametrisation aren't very predictive anymore, unless you calculate infeasibly many terms in the expensive perturbative series?

As far as I'm aware, when you're training really big models where the limit behaviour matters, you use parametrisations that don't get you too close to a kernel limit in the regime you're dealing with. Am I mistake...

4Sho Yaida5mo
Thank you for the discussion! Let us start by stressing that, of course, the maximal-update parametrization is definitely an intriguing recent development, and it would be very interesting to find tools to be able to understand the strongly-coupled regime in which it resides. Now, it seems like there are two different issues tangled in this discussion: (i) is one parameterization "better" than another in practice?; and (ii) is our effective theory analysis useful in practically interesting regimes? 1. The first item is perhaps more an empirical question, whose answer will likely emerge in coming years. But, even if maximal-update parametrization turns out to be universally better for every task, its strongly-coupled nature makes it very difficult to analyze, which perhaps makes it more problematic from a safety/interpretability perspective. 2. For the second item, we hope we will address concerns in the details of our reply below.  We'd like to also emphasize that, even if you are against NTK parameterization in practice and don't think it's relevant at all -- a position we don't hold, but maybe one might -- perhaps it's still worth pointing out that our work provides a simple solvable model of representation learning from which we might learn some general principles that may be applicable to safety and interpretability. With those said, let us respond to your comments point by point. We aren't sure if that's accurate: empirically, as nicely described in Jennifer's 8-page summary (in Sec. 1.5), many practical networks -- from a simple MLP to the not-very-simple GPT-3 -- seem to perform well in a regime where the depth-to-width aspect ratio is small (like 0.01 or at most 0.1). So, the leading-order perturbative description would be fairly accurate for describing these practically-useful networks. Moreover, one of the takeaways from "effective theory" descriptions is that we understand the truncation error: in particular, the errors from t

The book's results hold for a specific kind of neural network training parameterisation, the "NTK parametrisation", which has been argued (convincingly, to me) to be rather suboptimal. With different parametrisation schemes, neural networks learn features even in the infinite width limit.

You can show that neural network parametrisations can essentially be classified into those that will learn features in the infinite width limit, and those that will converge to some trivial kernel. One can then derive a "maximal update parametrisation", in which infi...

5danroberts6mo
Thank you for the comment! Let me reply to your specific points. First and TL; DR, in terms of whether NTK parameterization is "right" or "wrong" is perhaps an issue of prescriptivism vs. descriptivism: regardless of which one is "better", the NTK parameterization is (close to what is) commonly used in practice, and so if you're interested in modeling what practitioners do, it's a very useful setting to study. Additionally, one disadvantage of maximal update parameterization from the point of view of interpretability is that it's in the strong-coupling regime, and many of the nice tools we use in our book, e.g., to write down the solution at the end of training, cannot be applied. So perhaps if your interest is safety, you'd be shooting yourself in the foot if you use maximal update parameterization! :) Second, it is a common misconception that the NTK parameterization cannot learn features and that maximal update parameterization is the only parameterization that learns features. As discussed in the post above, all networks in practice have finite width; the infinite-width limit is a formal idealization. At finite width, either parameterization learns features. Moreover, in the formal infinite-width limit, it is true that *infinite-width with fixed depth* doesn't learn features, but you can also take a limit that scales up both depth and width together where NTK parameterization learns features. Indeed, one of the main results of the book is to say that, for NTK parameterization, the depth-to-width aspect ratio is the key hyperparameter that controls the theory describing how realistic networks behave. Third, the scaling up of hyperparameters is an aspect that follows from the understanding of either parameterization, NTK or maximal update; a benefit of this kind of the theory, from the practical perspective, is certainly learning how to correctly scale up to larger models. Fourth, I agree that maximal update parameterization is also interesting to study, espec

I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.

You've shown that if one proceeds naively, there is no way to make an agent that'd model the world perfectly, because it would need to model itself.

But real agents can't model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.

It seems to me that that's how humans...

1ojorgensen6mo
Firstly, thanks for reading the post! I think you're referring mainly to realisability here which I'm not that clued up on tbh, but I'll give you my two cents because why not.  I'm not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably "how can we get good abstractions of the world, given that we cannot perfectly model it". However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven't read that much into it yet. I'll link again to this video series on IB [https://www.lesswrong.com/posts/mSDwPeqAzYk79vLiA/understanding-infra-bayesianism-a-beginner-friendly-video] (which I'm yet to finish) as I think there are probably some good arguments here.

I haven't read that deeply into this yet, but my first reaction is that I don't see what this gains you compared to a perspective in which the functions mapping the inputs of the network to the activations of the layers are regarded as the network's elementary units.

Unless I'm misunderstanding something, when you look at the entire network , where  is the input, each polytope of f(x) with its affine transformation corresponds to one of the linear segments of . Same with looking at, say, the polytopes mapping layer  t...

3Lee Sharkey6mo
Thanks for your comment! Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274 [https://arxiv.org/abs/1810.09274] In the case of GELU and some similar activation functions, you'd need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1). There's some further explanation in Appendix C! This, indeed, is the assumption we wish to relax. Agreed! There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it's surprising when you view features as directions and don't qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a 'distribution of validity' in this post).

Sai, who is a lot more topology-savy than me, now suspects that there is indeed a connection between this norm approach and the topology of the intermediate set. We'll look into this.

Ah, right, you did mention polar coordinates.

Hm, stretching seems handleable. How about also using the weight matrix, for example? Change into the eigenbasis above, then apply stretching to make all L2 norms size 1 or size 0. Then look at the weights, as stretching-and-rotation invariant quantifiers of connectedness?

Maybe doesn't make much sense when considering non-linear transformations though.

2johnswentworth6mo
I think that's the same as finding a low-rank decomposition, assuming I correctly understand what you're saying?

Curious how looking at properties of the functions the  embed through their activation patterns fits into this picture.

For example, take the L2 norms of the activations of all entries of , averaged over some set of network inputs. The sum and product of those norms will both be coordinate independent.

In fact, we can go one step further, and form , the matrix of the L2 inner products of all the layer base elements with each other. The eigendecomposition of this matrix is also coordinate independent, up to dege...

6johnswentworth6mo
That would be true if the only coordinate changes we consider are rotations. But the post is talking about much more general transformations than that - we're allowing not only general linear transformations (i.e. stretching in addition to rotations), but also nonlinear transformations (which is why RELUs don't give a preferred coordinate system).

I like the "cut" framing, and I'm happy someone else is having a go at these sorts of questions from a somewhat different angle.

Let's say we want to express the following program:

def program(a, b, c):
if a:
return b + c
else:
return b - c

I'm not sure I understand the problem. Neural networks can implement operations equivalent to an if. They're going to be somewhat complicated, but that's to be expected.  An if just isn't an elementary operation to arithmetic. It takes some non-linearities to build up.

Layer Activation Space is a gene

...

I'm not sure if the number of near zero eigenvalues is the right thing to look at.

If the training process is walking around the parameter space until it "stumbles on" a basin, what's relevant for which basin is found isn't just the size of the basin floor, it's also how big the basin walls are. Analogy:  A very narrow cylindrical hole in a flat floor may be harder to fall into than a very wide, sloped hole. Even though the bottom of the later may be just a single point.

I've typically operated under the assumption that something like "basin volum...

Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?

Because individual transformer layers are assumed to only act on specific sub-spaces of the embedding space, and write their results back into the residual stream, so if you can show that different topics end up in different sub-spaces of the stream, you effectively show that different attention heads and MLPs must be d...

6NickyP7mo
Yeah, I would say this is the main idea I was trying to get towards. I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn't particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won't be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends. I have also tried using a "scaled cosine similarity" metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size). With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with "scaled cosine similarity", or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn't captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post. As for looking at the attention heads instead of the attention blocks, so far I haven't seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a

Well for starters, it narrows down the kind of type signature you might need to look for to find something like a "desire" inside an AI, if the training dynamics described here are broad enough to hold for the AI too.

It also helped me become less confused about what the "human values" we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you're actually aiming for might also be useful for many other alignment schemes.

Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.

1Tom Lieberum7mo
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse. * A token is either the first one of multi-token word or it isn't. * A word is either a noun, a verb or something else. * A word belongs to language LANG and not to any other language/has other meanings in those languages. * A H×W image can only contain so many objects which can only contain so many sub-aspects. I don't know what it would mean to go "out of distribution" in any of these cases. This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.

I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.

Fair enough, imprecise use of language. For some definitions of "thinking" I'd guess a small vision CNN isn't thinking anything.

Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.

Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts. But the mark of a general intelligence is that it can operate even in highly out-of-distribution situations. Cars are usually driven on roads, so an intelligence could get by using a car even if its concepts of car-ness were all mixed up with its con...

1TAG7mo
Reality is rich in features, but sparse in features that matter to a simple organism. That's why context matters.

Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.

In any case, for a network like the one you describe I would change my claim from

it'd mean that to the AI, dog heads and car fronts are "the same thing".

to the AI having a concept for something humans don't have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I'd call it a feature of "presence of dog heads or car fronts, or presence of car f...

1Tom Lieberum7mo
I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use.  As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper [https://transformer-circuits.pub/2022/solu/index.html] that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.   ETA: Ofc there could be some other mediating factor, too.

I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don't appear together. That means the model can still differentiate the two features, they are different in the model's ontology.

I'm not sure I understand this example. If I have a single 1-D feature, a floating point number that goes up with the amount of dog-headedness or car-front...

2Tom Lieberum7mo
Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image? If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let's assume we observe two numbers X,Y. With probability p, X=0,Y∼N(0,1), and with probability (1−p), Y=0,X∼N(0,1).  We now want to encode these two events in some third variable Z, such that we can perfectly reconstruct X,Y with probability ≈1. I put the solution behind a spoiler for anyone wanting to try it on their own.

Sure, but that's not a question I'm primarily interested in. I don't want the most interpretable basis, I want the basis that network itself uses for thinking. My goal is to find the elementary unit of neural networks, to build theorems and eventually a whole predictive theory of neural network computation and selection on top of.

That this may possibly make current networks more human-interpretable even in the short run is just a neat side benefit to me.

1Tom Lieberum7mo
Ah, I might have misunderstood your original point then, sorry!  I'm not sure what you mean by "basis" then. How strictly are you using this term? I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.

I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm.

Fair enough, should probably add a footnote.

More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; y

...
3Spencer Becker-Kahn7mo
You're right about the loss thing; it isn't as important as I first thought it might be.

Your way of doing it basically approximates the network to first order in the parameter changes/second order in the loss function. That's the same as the method I'm proposing above really, except you're changing the features to account for the chain rule acting on the layers in front of them. You're effectively transforming the network into an equivalent one that has a single linear layer, with the entries of  as the features.

That's fine to do when you're near a global optimum, the case discussed in the main body of this post, and for tin...

What do you think about the Superposition Hypothesis? If that were true, then at a sufficient sparsity of features in the input there is no basis in which the network is thinking in, meaning it will be impossible to find a rotation matrix that allows for a bijective mapping between neurons and features.

I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.

If due to superposition, it proves advantageous to the AI to have a sing...

1Tom Lieberum7mo
Well, yes but the number of basis elements that make that basis human interpretable could theoretically be exponential in the number of neurons.
2Tom Lieberum7mo
I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don't appear together. That means the model can still differentiate the two features, they are different in the model's ontology. My intuition disagrees here too. Whether we will observe superposition is a function of (number of "useful" features in the data), (sparsity of said features), and something like (bottleneck size). It's possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.

So the eigenvector  doesn't give you the features directly in imagespace, it gives you the network parameters which "measure" the feature?

Nope, you can straightforwardly read off the feature in imagespace, I think. Remember, the eigenvector doesn't just show you which parameters "form" the feature through linear combination, it also shows you exactly what that linear combination is. If your eigenvector is (2,0,-3), that means the feature in image space looks like taking the twice the activations of the node connected to , plus -3 times th...

2tailcalled7mo
Hmm, I suppose in the single-linear-layer case, your way of transferring it to imagespace is equivalent to mine, whereas in the multi-nonlinear-layer case, I am not sure which generalization is the most appropriate.

I think we're far off from being able to make any concrete claims about selection dynamics with this, let alone selection dynamics about things as complex and currently ill-operationalised as "goals".

I'd hope to be able to model complicated things like this once Selection Theory is more advanced, but right now this is just attempting to find angles to build up the bare basics.

In your main computation it seems like it's being treated as a scalar.

It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.

Vivek wanted to suppose that  were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss.

In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation...

4Spencer Becker-Kahn7mo
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; you can't pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written - like a bilinear form for the multiplication between the rows and columns of Jf.   OK maybe I'll try to avoid a debate about exactly what 'feature' means or means to different people, but in the example, you are clearly using f(x)=Θ0+Θ1x1+Θ2cos(x1). This is a linear function of the Θ variables. (I said "Is the example not too contrived....in particular it is linear in Θ" - I'm not sure how we have misunderstood each other, perhaps you didn't realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn't depend on Θ. So again, in the crucial equation right after you say "The Hessian matrix for this network would be...", you in general get Θ variables appearing in the matrix. It is just not as clean as this expression suggests in general.

Interesting idea, and I'm generally very in favour of any efforts to find more understandable and meaningful "elementary units" of neural networks right now. I think this is currently the research question that most bottlenecks any efforts to get a deeper understanding of NN internals and NN selection, and I think those things are currently the biggest bottlenecks to any efforts at generating alignment strategies that might actually work. So we should be experimenting with lots of ideas for different NN "bases" to use and construct our theory of Deep Learn...

1Garrett Baker7mo
I agree entirely with this bottleneck analysis, and am also very excited about the work you're doing and have just posted.

I think the idea is that if the rotated basis fundamentally "means" something important, rather than just making what's happening easier to picture for us humans, we'd kind of expect the basis computed for X->Y to mostly match the basis for Y->Z.

At least that's the sort of thing I'd expect to see in such a world.

2Garrett Baker7mo
Yup, this is why I'm skeptical there will be a positive result. I did not try to derive a principled, meaningful, basis. I tried the most obvious thing to do which nobody else seems to have done. So I expect this device will be useful and potentially the start of something fundamental, but not fundamental itself.

You take the gradient with respect to any preactivation of the next layer. Shouldn't matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.

The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.

Someone more versed in this line of research clue me in please: Conditional on us having developed the kind of deep understanding of neural networks and their training implicit in having "agentometers" and "operator recognition programs" and being able to point to specific representations of stuff in the AGIs' "world model" at all, why would we expect picking out the part of the model that corresponds to human preferences specifically to be hard and in need of precise mathematical treatment like this?

An agentometer is presumably a thing that finds st...

3Jeremy Gillen7mo
I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an "agentometer".  If we had a way of iterating over every "computation" in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.  Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it's not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.   Yeah this is approximately how I think the "operator identification" would work.   Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for "agentyness".   These can be computations that aren't every agenty, or don't match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn't have to be very good. The g computation is used to find among these the ones that best match an agent.

I think I might just commit to staying away from LSD and Mind Illuminated style meditation entirely. Judging by the frequency of word of mouth accounts like this, the chance of going a little or a lot insane while exposed to them seems frighteningly high.

I wonder why these long term effects seem relatively sparsely documented. Maybe you have to take the meditation really seriously and practice diligently for this stuff to have a high chance of happening, and people in this community do that often, but the average study population doesn't?

-8Viliam7mo

Yeah, I think people who are high in abstract thinking and believing their beliefs and anxious thought patterns should really stay away from psychedelics and from leaning too hard into their run-away thought trains. Also, try to stay grounded with people and activities that don't send you off into abstract thought space. Spend some time with calm normal people who look at the world in straightforward ways, not only creative wild thinkers. Spend time doing hobbies outdoors that use your physical body and attention in satisfying ways, keeping you engaged enough to stay out of your head.

There can also be factors in this community that make people both unusually likely to go insane and to also try things like meditation and LSD in an attempt to help themselves. It's a bit hard to say given that the post is so vague on what exactly "insanity" means, but the examples of acausal trade etc. make me suspect that it's related to a specific kind of anxiety which seems to be common in the community.

That same kind of anxiety also made me (temporarily) go very slightly crazy many years ago, when I learned about quantum mechanics (and I had nei...

9ChristianKl7mo
The MBSR studies are two-month interventions. They are not going to have the same strong effects as people meditating seriously for years.  On the other hand, those studies that investigate people who meditate a lot are often from a monastic setting where people have teachers which is quite different from someone meditating without a teacher and orienting themselves with the Mind Illuminated.

Even if they were somehow extremely beneficial normally (which is fairly unlikely), any significant risk of going insane seems much too high. I would posit they have such a risk for exactly the same reason -when using them, you are deliberately routing around very fundamental safety features of your mind.

Note: I think what you're doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.

The "pointiest directions" in parameter space correspond to the biggest features in the orthogonalised feature set of the network.

So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you'd indeed often find that it's close to being an eigenvector/most prominent orthogonalised feature of the second networ...

There should be a post with some of it out soon-ish. Short summary:

You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how "big" these features are.

The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given b...

It seems to be taken for granted here that self-awareness=qualia. If something is self-aware and talking or thinking about how it has qualia, that sure is evidence of it having qualia, but I'm not sure the reverse direction holds. What about internal-state-tracking is necessary for creating the mysterious redness of red exactly, or the hurt-iness of pain?

I can see how pain as defined above the spoiler section doesn't necessarily lead to pain qualia, and in many simple architectures obviously doesn't, but I don't see how processing a summary of pain e...

1Thane Ruthenis7mo
Great questions! Well, as you note, the only time we notice these things is when we self-model, and they otherwise have no causal effect on reality; a mind that doesn't self-reflect is not affected by them. So... that can only mean they only exist when we self-reflect. Mm, the summary-interpretation mechanism? Imagine if instead of an eye, you had a binary input, and the brain was hard-wired to parse "0" from this input as a dog picture, and "1" as a cat picture. So you perceive 1, the signal travels to the brain, enters the pre-processing machinery, that machinery retrieves the cat picture, and shoves it into the visual input of your planner-part, claiming it's what the binary organ perceives. Similarly, the binary pain channel you're describing would retrieve some hard-coded idea of how "I'm in pain" is meant to feel, convert it into a format the planner can parse, put it into some specialized input channel, and the planner would make decisions based on that. This would, of course, not be the rich and varied and context-dependent sense of pain we have — it would be, well, binary, always feeling the same.

I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity.

I have some math that hints that those may be equivalent-ish statements.

I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters?

Why would we expect a 10x times distillation ...

1Ivan Vendrov7mo
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.

Another general-purpose search trick which someone will probably bring up if I don’t mention it is caching solutions to common subproblems. I don’t think of this as an heuristic; it mostly doesn’t steer the search process, just speed it up.

Terminology quibble, but this totally seems like a heuristic to me. When faced with a problem that seems difficult to solve directly, first find the most closely related problem that seems easy to solve, seems like the overriding general heuristic generator that encompasses both problem relaxation and solution memor...

3Antoine de Scorraille5mo
The difference (here) between "Heuristic" and "Cached-Solutions" seems to me analogous to the difference between lazy evaluation [https://en.wikipedia.org/wiki/Lazy_evaluation] and memoization [https://en.wikipedia.org/wiki/Memoization]: 1.  Lazy evaluation ~ Heuristic: aims to guide the evaluation/search by reducing its space. 2. Memoization ~ Cached Solutions: stores in memory the values/solutions already discovered to speed up the calculation.

I don't think I'm seeing the complexity you're seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set's qualitative influence on behavior. I don't think this requires rigorous operationalizations.

That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you'd be able to separate out the things-that-are-agency from everything e...

I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.

I think unless you're extremely lucky and this turns out to be a highly human-visible thing somehow, you'd never notice what you're looking for among all the other complicated changes happening that nobody has analysis tools or even vague definitions for yet.

Which easier methods do you have in mind?

Dunno. I was just stating a general project-picking heuristic I have, and that it's eyeing your proposa...

2Garrett Baker8mo
Good ideas! I worry that a shallow MLP wouldn't be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first. I don't think I'm seeing the complexity you're seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set's qualitative influence on behavior. I don't think this requires rigorous operationalizations.  An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they'd give us insight into what the head-set was doing, even if we don't understand the mechanisms used inside the head set.