silentbob - LessWrong

Maybe I accidentally overpromised here :D this code is just an expression, namely 1.0000000001 ** 175000000000, which, as wolframalpha agrees, yields 3.98e7.

silentbob's Shortform

silentbob1mo370

One crucial question in understanding and predicting the learning process, and ultimately the behavior, of modern neural networks, is that of the shape of their loss landscapes. What does this extremely high dimensional landscape look like? Does training generally tend to find minima? Do minima even exist? Is it predictable what type of minima (or regions of lower loss) are found during training? What role does initial randomization play? Are there specific types of basins in the landscape that are qualitatively different from others, that we might care about for safety reasons?

First, let’s just briefly think about very high dimensional spaces. One somewhat obvious observation is that they are absolutely vast. With each added dimension, the volume of the available space increases exponentially. Intuitively we tend to think of 3-dimensional spaces, and often apply this visual/spatial intuition to our understanding of loss landscapes. But this can be extremely misleading. Parameter spaces are utterly incredibly vast to a degree that our brain can hardly fathom. Take GPT3 for instance. It has 175 billion parameters, or dimensions. Let’s assume somewhat arbitrarily that all parameters end up in a range of [-0.5, 0.5], i.e. live in a 175-billion-dimensional unit cube around the origin of that space (as this is not the case, the real parameter space is actually even much, much larger, but bear with me). Even though every single axis only varies by 1 – let’s just for the sake of it interpret this as “1 meter” – even just taking the diagonal from one corner to the opposite one in this high-dimensional cube, you would get a length of ~420km. So if, hypothetically, you were sitting in the middle of this high dimensional unit cube, you could easily touch every single wall with your hand. But nonetheless, all the corners would be more than 200km distant from you.

This may be mind boggling, but is it relevant? I think it is. Take this realization for instance: if you have two minima in this high dimensional space, but one is just a tiny bit “flatter” than the other (meaning the second derivatives overall are a bit closer to 0), then the attractor basin of this flatter minimum is vastly larger than that of the other minimum. This is because the flatness implies a larger radius, and the volume depends exponentially on that radius. So, at 175 billion dimensions, even a microscopically larger radius means an overwhelmingly larger volume. If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is). And this is only for GPT3, which is almost 4 years old by now.

The parameter space is just ridiculously large, so it becomes really crucial how the search process through it works and where it lands. It may be that somewhere in this vast space, there are indeed attractor basins that correspond to minima that we find extremely undesirable – certain capable optimizers perhaps, that have situational awareness and deceptive tendencies. If they do exist, what could we possibly tell about them? Maybe these minima have huge attractor basins that are reliably found eventually (maybe once we switch to a different network architecture, or find some adjustment to gradient descent, or reach a certain model size, or whatever), which would of course be bad news. Or maybe these attractor basins are so vanishingly small that we basically don’t have to care about them at all, because all the computer & search capacity of humanity over the next million years would have an almost 0 chance of ever stumbling onto these regions. Maybe they are even so small that they are numerically unstable, and even if your search process through some incredible cosmic coincidence happens to start right in such a basin, the first SGD step would immediately jump out of it due to the limitations of numerical accuracy on the hardware we’re using.

So, what can we actually tell at this point about the nature of high dimensional loss landscapes? While reading up on this topic, one thing that constantly came up is the fact that, the more dimensions you have, the lower the relative number of minima becomes compared to saddle points. Meaning that whenever the training process appears to slow down and it looks like it found some local minimum, it’s actually overwhelmingly likely that what it actually found is a saddle point, hence the training process never halts but keeps moving through parameter space, even if the loss doesn't change that much. Do local minima exist at all? I guess it depends on the function the neural network is learning to approximate. Maybe some loss landscapes exist where the loss can just get asymptotically closer to some minimum (such as 0), without ever reaching it. And probably other loss landscapes exist where you actually have a global minimum, as well as several local ones.

Some people argue that you probably have no minima at all, because with each added dimension it becomes less and less likely that a given point is a minimum (because not only does the first derivative of a point have to be 0 for it to be a minimum, also all the second derivatives need to be in on it, and all be positive). This sounds compelling, but given that the space itself also grows exponentially with each dimension, we also have overwhelmingly more points to choose from. If you e.g. look at n-dimensional Perlin Noise, its absolute number of local minima within an n-dimensional cube of constant side length actually increases with each added dimension. However, the relative number of local minima compared to the available space still decreases, so it becomes harder and harder to find them.

I’ll keep it at that. This is already not much of a "quick" take. Basically, more research is needed, as my literature review on this subject yielded way more questions than answers, and many of the claims people made in their blog posts, articles and sometimes even papers seemed to be more intuitive / common-sensical or generalized from maybe-not-that-easy-to-validly-generalize-from research.

One thing I’m sure about however is that almost any explanation of how (stochastic) gradient descent works, that uses 3D landscapes for intuitive visualizations, is misleading in many ways. Maybe it is the best we have, but imho all such explainers should come with huge asterisks, explaining that the rules in very high dimensional spaces may look much different than our naive “oh look at that nice valley over there, let’s walk down to its minimum!” understanding, that happens to work well in three dimensions.

What are your greatest one-shot life improvements?

silentbob1mo10

So how did it work out for you?

OpenAI: Fallout

silentbob2mo21

That seems like a rather uncharitable take. Even if you're mad at the company, would you (at least (~falsely) assuming this all may indeed be standard practice and not as scandalous as it turned out to be) really be willing to pay millions of dollars for the right to e.g. say more critical things on Twitter, that in most cases extremely few people will even care about? I'm not sure if greed is the best framing here.

(Of course the situation is a bit different for AI safety researchers in particular, but even then, there's not that much actual AI (safety) related intel that even Daniel was able to share that the world really needs to know about; most of the criticism OpenAI is dealing with now is on this meta NDA/equity level)

To an LLM, everything looks like a logic puzzle

silentbob2mo10

I would assume ChatGPT gets much better at answering such questions if you add to the initial prompt (or system prompt) to eg think carefully before answering. Which makes me wonder whether "ChatGPT is (not) intelligent" even is a meaningful statement at all, given how vastly different personalities (and intelligences) it can emulate, based on context/prompting alone. Probably a somewhat more meaningful question would be what the "maximum intelligence" is that ChatGPT can emulate, which can be very different from its standard form.

The Alignment Problem No One Is Talking About

silentbob2mo32

Just to note your last paragraph reminds me of Stuart Russel's approach to AI alignment in Human Compatible. And I agree this sounds like a reasonable starting point.

The Alignment Problem No One Is Talking About

silentbob2mo32

Thanks for the post, I find this unique style really refreshing.

I would add to it that there's even an "alignment problem" on the individual level. A single human in different circumstances and at different times can have quite different, sometimes incompatible values, preferences and priorities. And even at any given moment their values may be internally inconsistent and contradictory. So this problem exists on many different levels. We haven't "solved ethics", humanity disagrees about everything, even individual humans disagree with themselves, and now we're suddenly racing towards a point where we need to give AI a definite idea of what is good & acceptable.

Searching for Searching for Search

silentbob3moΩ010

Aren't LLMs already capable of two very different kinds of search? Firstly, their whole deal is predicting the next token - which is a kind of search. They're evaluation all the tokens at every step, and in the end choose the most probable seeming one. Secondly, across-token search when prompted accordingly. Say "Please come up with 10 options for X, then rate them all according to Y, and select the best option" is something that current LLMs can perform very reliably - whether or not "within token search" exists as well. But then again, one might of course argue that search happening within a single forward pass, and maybe even a type of search that "emerged " via SGD rather than being hard baked into the architecture, would be particularly interesting/important/dangerous. We just shouldn't make the mistake of assuming that this would be the only type of search that's relevant.

I think across-token search via prompting already has the potential to lead to the AGI like problems that we associate with mesa optimizers. Evidently the technology is not quite there yet because PoCs like AutoGPT basically don't quite work, so far. But conditional on AGI being developed in the next few years, it would seem very likely to me that this kind of search would be the one that enables it, rather than some hidden "O(1)" search deeply within the network itself.

Edit: I should of course add a "thanks for the post" and mention that I enjoyed reading it, and it made some very useful points!

Searching for Search

silentbob3mo10

Great post! Two thoughts that came to mind while reading it:

the post mostly discussed search happening directly within the network, e.g. within a single forward pass; but what can also happen e.g. in the case of LLMs is that search happens across token-generation rather than within. E.g. you could give ChatGPT a chess constellation and then ask it to list all the valid moves, and then check which move would lead to which state, and if that state looks better than the last one. This would be search depth 1 of course, but still a form of search. In practice it may be difficult because ChatGPT likes to give messages only of a certain length, so it probably stops prematurely if the search space gets too big, but still, search most definitely takes place in this case.
somewhat of a project proposal, ignoring my previous point and getting back to "search within a single forward pass of the network": let's assume we can "intelligent design" our way to a neural network that actually does implement some kind of small search to solve a problem. So we know the NN is on some pretty optimal solution for the problem it solves. What does (S)GD look like at or very near to this point? Would it stay close to this optimum, or maybe instantly diverge away, e.g. because the optimum's attractor basin is so unimaginably tiny in weight space that it's just numerically highly unstable? If the latter (and if this finding indeed generalizes meaningfully), then one could assume that even though search "exists" in parameter space, it's impractical to ever be reached via SGD due to the unfriendly shape of the search space.

Failures in Kindness

silentbob3mo50

Thanks a lot! Appreciated, I've adjusted the post accordingly.

LESSWRONG
LW

Posts

Wiki Contributions

Comments