I'm honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That's either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?
Could a born blind human do this?
But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?
Possibly yes. I could easily see this underlying human preferences for regular patterns in art. Predictable enough to get a high score, not so predictable that whatever secondary boredom mechanism that keeps baby humans from maximising score by staring straight at the ceiling all day kicks in. I'm even getting suspiciou...
That separation between internal preferences and external behaviour is already implicit in Dutch books. Decision theory is about external behaviour, not internal representations. It talks about what agents do, not how agents work inside. As parts of decision theory, a preference, to them, is about something the system does or does not do in a given situation. When they talk about someone preferring pizza without pineapple, it's about that person paying money to not have pineapple on their pizza in some range of situations, not some definition related to computations about pineapples and pizzas in that person's brain.
I'd guess that the same structural properties that would make a network start out in the scarce channel regime by default would also make unintended channels rare. If the internal structure is such that very little information gets passed on unless you invest optimisation to make it otherwise, that same property should mean free rides are not common.
More central point, I'm a bit doubtful that this potential correspondence is all that important for understanding information transfer inside neural networks. Extant (A)GIs seem to have very few interface point...
To be frank, I have no idea what this is supposed to mean. If “make non-magical, humanlike systems” were actionable[1], there would not be much of an alignment problem. If this post is supposed to indicate that you think you have an idea for how to do this, but it's a secret, fine. But what is written here, by itself, sounds like a wish to me, not like a research agenda.
Outside of getting pregnant, I suppose.
While funny, I think that tweet is perhaps a bit too plausible, and may be mistaken as having been aimed at statistical learning theorists for real, if a reader isn't familiar with its original context. Maybe flag that somehow?
I don't typically imagine gradient hacking to be about mesa optimisers protecting themselves from erasure. Mesa optimisers are good at things. If you want to score well on a hard loss function involving diverse tricky problems, a mesa optimiser is often a great way of doing that. I do not think they would typically need to protect their optimisation circuitry from gradient descent.
Two prototypical examples of gradient hacking as I imagine it in my head are:
I'm confused about this.
Say our points are the times of day measured by a clock. And are the temperatures measured by a thermometer at those times. We’re putting in times in the early morning, where I decree temperature to increase roughly linearly as the sun rises.
You write the overparametrized regression model as . Since our model doesn’t get to see the index, only the value of itself, that has to implicitly be something like
Where ...
The other risk that could motivate not making this bet is the risk that the market – for some unspecified reason – never has a chance to correct, because (1) transformative AI ends up unaligned and (2) humanity’s conversion into paperclips occurs overnight. This would prevent the market from ever “waking up”.
You don't even need to expect it to occur overnight. It's enough for the market update to predictably occur so late that having lots of money available at that point is no longer useful. If AGI ends the world next week, there's not that ...
Interested, but depends on the cost. If I'm the only one who wants it, I'd be willing to pay $30 to get the whole series, but probably not more. I don't know how long transcriptions usually take, but I'm guessing it'd certainly be >1h. So there'd need to be additional interest to make it worth it.
Epistemic status: sleep deprived musings
If I understand this right, this is starting to sound very testable.
Feed a neural network inputs consisting of variables . Configurations in a 2D Ising model, cat pictures, or anything else we humans think we know the latent variables for.
Train neural networks to output a set of variables over the inputs. The loss function scores based on how much the output induces conditional independence of inputs over the training data set.
E.g., take the divergence between...
While our current understanding of physics is predictably-wrong, it has no particular reason to be wrong in a way that is convenient for us.[1]
Meanwhile, more refined versions of some of the methods described here seem perhaps doable in principle, with sufficient technology.
You can make difficult things happen by trying hard at them. You can't violate the laws of physics by trying harder.
Out of the many things that might be wrong about the current picture, impossibility of time travel is also one of the things I'd least expect to get overturne
This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.
I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?
You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.
Our team has copied Lightcone's approach to communicating over Discord, and I've been very happy with it.
Aren't Standard Parametrisation and other parametrisations with a kernel limit commonly used mostly in cases where you're far away from reaching the depth-to-width≈0 limit, so expansions like the one derived for the NTK parametrisation aren't very predictive anymore, unless you calculate infeasibly many terms in the expensive perturbative series?
As far as I'm aware, when you're training really big models where the limit behaviour matters, you use parametrisations that don't get you too close to a kernel limit in the regime you're dealing with. Am I mistake...
The book's results hold for a specific kind of neural network training parameterisation, the "NTK parametrisation", which has been argued (convincingly, to me) to be rather suboptimal. With different parametrisation schemes, neural networks learn features even in the infinite width limit.
You can show that neural network parametrisations can essentially be classified into those that will learn features in the infinite width limit, and those that will converge to some trivial kernel. One can then derive a "maximal update parametrisation", in which infi...
I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.
You've shown that if one proceeds naively, there is no way to make an agent that'd model the world perfectly, because it would need to model itself.
But real agents can't model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.
It seems to me that that's how humans...
I haven't read that deeply into this yet, but my first reaction is that I don't see what this gains you compared to a perspective in which the functions mapping the inputs of the network to the activations of the layers are regarded as the network's elementary units.
Unless I'm misunderstanding something, when you look at the entire network , where is the input, each polytope of f(x) with its affine transformation corresponds to one of the linear segments of . Same with looking at, say, the polytopes mapping layer t...
Sai, who is a lot more topology-savy than me, now suspects that there is indeed a connection between this norm approach and the topology of the intermediate set. We'll look into this.
Ah, right, you did mention polar coordinates.
Hm, stretching seems handleable. How about also using the weight matrix, for example? Change into the eigenbasis above, then apply stretching to make all L2 norms size 1 or size 0. Then look at the weights, as stretching-and-rotation invariant quantifiers of connectedness?
Maybe doesn't make much sense when considering non-linear transformations though.
Curious how looking at properties of the functions the embed through their activation patterns fits into this picture.
For example, take the L2 norms of the activations of all entries of , averaged over some set of network inputs. The sum and product of those norms will both be coordinate independent.
In fact, we can go one step further, and form , the matrix of the L2 inner products of all the layer base elements with each other. The eigendecomposition of this matrix is also coordinate independent, up to dege...
I like the "cut" framing, and I'm happy someone else is having a go at these sorts of questions from a somewhat different angle.
Let's say we want to express the following program:
def program(a, b, c): if a: return b + c else: return b - c
I'm not sure I understand the problem. Neural networks can implement operations equivalent to an if. They're going to be somewhat complicated, but that's to be expected. An if just isn't an elementary operation to arithmetic. It takes some non-linearities to build up.
...Layer Activation Space is a gene
I'm not sure if the number of near zero eigenvalues is the right thing to look at.
If the training process is walking around the parameter space until it "stumbles on" a basin, what's relevant for which basin is found isn't just the size of the basin floor, it's also how big the basin walls are. Analogy: A very narrow cylindrical hole in a flat floor may be harder to fall into than a very wide, sloped hole. Even though the bottom of the later may be just a single point.
I've typically operated under the assumption that something like "basin volum...
Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?
Because individual transformer layers are assumed to only act on specific sub-spaces of the embedding space, and write their results back into the residual stream, so if you can show that different topics end up in different sub-spaces of the stream, you effectively show that different attention heads and MLPs must be d...
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a "desire" inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the "human values" we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you're actually aiming for might also be useful for many other alignment schemes.
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.
I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.
Fair enough, imprecise use of language. For some definitions of "thinking" I'd guess a small vision CNN isn't thinking anything.
Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.
Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts. But the mark of a general intelligence is that it can operate even in highly out-of-distribution situations. Cars are usually driven on roads, so an intelligence could get by using a car even if its concepts of car-ness were all mixed up with its con...
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
In any case, for a network like the one you describe I would change my claim from
it'd mean that to the AI, dog heads and car fronts are "the same thing".
to the AI having a concept for something humans don't have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I'd call it a feature of "presence of dog heads or car fronts, or presence of car f...
I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don't appear together. That means the model can still differentiate the two features, they are different in the model's ontology.
I'm not sure I understand this example. If I have a single 1-D feature, a floating point number that goes up with the amount of dog-headedness or car-front...
Sure, but that's not a question I'm primarily interested in. I don't want the most interpretable basis, I want the basis that network itself uses for thinking. My goal is to find the elementary unit of neural networks, to build theorems and eventually a whole predictive theory of neural network computation and selection on top of.
That this may possibly make current networks more human-interpretable even in the short run is just a neat side benefit to me.
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
...More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say "The Hessian matrix for this network would be...", you don't get a factorization like that; y
Your way of doing it basically approximates the network to first order in the parameter changes/second order in the loss function. That's the same as the method I'm proposing above really, except you're changing the features to account for the chain rule acting on the layers in front of them. You're effectively transforming the network into an equivalent one that has a single linear layer, with the entries of as the features.
That's fine to do when you're near a global optimum, the case discussed in the main body of this post, and for tin...
What do you think about the Superposition Hypothesis? If that were true, then at a sufficient sparsity of features in the input there is no basis in which the network is thinking in, meaning it will be impossible to find a rotation matrix that allows for a bijective mapping between neurons and features.
I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.
If due to superposition, it proves advantageous to the AI to have a sing...
So the eigenvector doesn't give you the features directly in imagespace, it gives you the network parameters which "measure" the feature?
Nope, you can straightforwardly read off the feature in imagespace, I think. Remember, the eigenvector doesn't just show you which parameters "form" the feature through linear combination, it also shows you exactly what that linear combination is. If your eigenvector is (2,0,-3), that means the feature in image space looks like taking the twice the activations of the node connected to , plus -3 times th...
I think we're far off from being able to make any concrete claims about selection dynamics with this, let alone selection dynamics about things as complex and currently ill-operationalised as "goals".
I'd hope to be able to model complicated things like this once Selection Theory is more advanced, but right now this is just attempting to find angles to build up the bare basics.
In your main computation it seems like it's being treated as a scalar.
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
Vivek wanted to suppose that were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss.
In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation...
Interesting idea, and I'm generally very in favour of any efforts to find more understandable and meaningful "elementary units" of neural networks right now. I think this is currently the research question that most bottlenecks any efforts to get a deeper understanding of NN internals and NN selection, and I think those things are currently the biggest bottlenecks to any efforts at generating alignment strategies that might actually work. So we should be experimenting with lots of ideas for different NN "bases" to use and construct our theory of Deep Learn...
I think the idea is that if the rotated basis fundamentally "means" something important, rather than just making what's happening easier to picture for us humans, we'd kind of expect the basis computed for X->Y to mostly match the basis for Y->Z.
At least that's the sort of thing I'd expect to see in such a world.
You take the gradient with respect to any preactivation of the next layer. Shouldn't matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.
The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.
Someone more versed in this line of research clue me in please: Conditional on us having developed the kind of deep understanding of neural networks and their training implicit in having "agentometers" and "operator recognition programs" and being able to point to specific representations of stuff in the AGIs' "world model" at all, why would we expect picking out the part of the model that corresponds to human preferences specifically to be hard and in need of precise mathematical treatment like this?
An agentometer is presumably a thing that finds st...
I think I might just commit to staying away from LSD and Mind Illuminated style meditation entirely. Judging by the frequency of word of mouth accounts like this, the chance of going a little or a lot insane while exposed to them seems frighteningly high.
I wonder why these long term effects seem relatively sparsely documented. Maybe you have to take the meditation really seriously and practice diligently for this stuff to have a high chance of happening, and people in this community do that often, but the average study population doesn't?
Yeah, I think people who are high in abstract thinking and believing their beliefs and anxious thought patterns should really stay away from psychedelics and from leaning too hard into their run-away thought trains. Also, try to stay grounded with people and activities that don't send you off into abstract thought space. Spend some time with calm normal people who look at the world in straightforward ways, not only creative wild thinkers. Spend time doing hobbies outdoors that use your physical body and attention in satisfying ways, keeping you engaged enough to stay out of your head.
There can also be factors in this community that make people both unusually likely to go insane and to also try things like meditation and LSD in an attempt to help themselves. It's a bit hard to say given that the post is so vague on what exactly "insanity" means, but the examples of acausal trade etc. make me suspect that it's related to a specific kind of anxiety which seems to be common in the community.
That same kind of anxiety also made me (temporarily) go very slightly crazy many years ago, when I learned about quantum mechanics (and I had nei...
Even if they were somehow extremely beneficial normally (which is fairly unlikely), any significant risk of going insane seems much too high. I would posit they have such a risk for exactly the same reason -when using them, you are deliberately routing around very fundamental safety features of your mind.
Note: I think what you're doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.
The "pointiest directions" in parameter space correspond to the biggest features in the orthogonalised feature set of the network.
So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you'd indeed often find that it's close to being an eigenvector/most prominent orthogonalised feature of the second networ...
There should be a post with some of it out soon-ish. Short summary:
You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how "big" these features are.
The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given b...
It seems to be taken for granted here that self-awareness=qualia. If something is self-aware and talking or thinking about how it has qualia, that sure is evidence of it having qualia, but I'm not sure the reverse direction holds. What about internal-state-tracking is necessary for creating the mysterious redness of red exactly, or the hurt-iness of pain?
I can see how pain as defined above the spoiler section doesn't necessarily lead to pain qualia, and in many simple architectures obviously doesn't, but I don't see how processing a summary of pain e...
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity.
I have some math that hints that those may be equivalent-ish statements.
I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters?
Why would we expect a 10x times distillation ...
Another general-purpose search trick which someone will probably bring up if I don’t mention it is caching solutions to common subproblems. I don’t think of this as an heuristic; it mostly doesn’t steer the search process, just speed it up.
Terminology quibble, but this totally seems like a heuristic to me. When faced with a problem that seems difficult to solve directly, first find the most closely related problem that seems easy to solve, seems like the overriding general heuristic generator that encompasses both problem relaxation and solution memor...
I don't think I'm seeing the complexity you're seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set's qualitative influence on behavior. I don't think this requires rigorous operationalizations.
That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you'd be able to separate out the things-that-are-agency from everything e...
I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.
I think unless you're extremely lucky and this turns out to be a highly human-visible thing somehow, you'd never notice what you're looking for among all the other complicated changes happening that nobody has analysis tools or even vague definitions for yet.
Which easier methods do you have in mind?
Dunno. I was just stating a general project-picking heuristic I have, and that it's eyeing your proposa...
Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there's nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.