""
The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?"
""
This feels like a bit of a digression, but we do have concrete examples of systems...
The general rule I'm following is "if the argument would say false things about humans, then don't update on it".
Yes, this is of course very sensible. However, I don't see why these arguments would apply to humans, unless you make some additional assumption or connection that I am not making. Considering the rest of the conversation, I assume the difference is that you draw a stronger analogy between brains and deep learning systems than I do?
I want to ask a question that goes something like "how correlated is your credence that arguments 5-10 apply to hum...
But for all of them except argument 6, it seems like the same argument would imply that humans would not be generally intelligent.
Why is that?
Because text on the Internet sometimes involves people using logic, reasoning, hypothesis generation, analyzing experimental evidence, etc, and so plausibly the simplest program that successfully predicts that text would do so by replicating that logic, reasoning etc, which you could then chain together to make scientific progress.
What does the argument say in response?
There are a few ways to respond.
First of all, wh...
"human-level question answering is believed to be AI-complete" - I doubt that. I think that [...]
Yes, these are also good points. Human-level question answering is often listed as a candidate for an AI-complete problem, but there are of course people who disagree. I'm inclined to say that question-answering probably is AI-complete, but that belief is not very strongly held. In your example of the painter; you could still convey a low-resolution version of the image as a grid of flat colours (similar to how images are represented in computers), and tell the...
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
The benchmarks tell you about what the existing systems do. They don't tell you about what's possible.
Of course. It is almost certainly possible to solve the problem of catastrophic forgetting, and the solution might not be that complicated either. My point is that it is a fairly significant problem that has not yet been solved, and that solving it probably requires some insight or idea that does not yet exist. You can achieve some degree of lifelong learning through regularised fine-tuning, but you cannot get anywhere near what would be required for human...
If you look at a random text on the internet it would be very surprising if every word in it is the most likely word to follow based on previous words.
I'm not completely sure what your point is here. Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times. Would it be surprising if you then get heads 10 times in a row? Yes, in a sense. But that is still the most likely individual sequence.
...The step from GTP3 to InstructGPT and ChatGPT was not one of scaling up in terms of size of models and sub
What I'm suggesting is that volume in high-dimensions can concentrate on the boundary.
Yes. I imagine this is why overtraining doesn't make a huge difference.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
See e.g., page 47 in the main paper.
(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)
I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".
...This feels similar to:
Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne
I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.
I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the...
I have a few comments on this:
Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.
The main point, as I see it, is essentially tha...
This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.
But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people i...
Like Rohin, I'm not impressed with the information theoretic side of this work.
Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discr...
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I'm not sure if accounting for the floating-point implementation is informative or not in this case.
There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training...
I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and bat...
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:
The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express...
We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural netw...
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means tha...
Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions".
The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accura...
As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).
In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb's Problem as a Marko...
I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.
I also agree with what ofer said.
"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 b...
The fact that a superintelligent AI contains an optimization algorithm does not necessarily mean that this optimization algorithm is itself superintelligent (or that it has access to the world model of the overall system, etc). It might, it might not – it depends on the design of the system.
"the Cartesian boundary is an issue once we talk about self-improving AI."
This presumably depends on a lot of specific facts about how the system is designed.
I know these things. Nothing you have said contradicts my point, as far as I can see. The point I am making here is one of conceptual clarification, which the intent of enabling more clear thinking and reasoning.
You seem to be talking about a system that outputs "plans that, if implemented, would achieve X" (roughly), and your point seems to be that such a system would be likely to be or behave like an optimizer_2. I find this claim quite plausible (and fully compatible with the point I'm making).
"It would require a selective blindne...
The argument I quoted does not mention evolution. I'm not saying that the argument can't be patched, I'm saying that the argument is inadequate as stated. I should note, however, that evolution selects organisms based on their ability to do optimisation_2, not their ability to do optimisation_1. It's therefore not clear when and how you can simply "generalise from evolution".
A linear program solver is a system that maximises or minimises a linear function subject to non-strict linear constraints.
Many SAT-solvers are implemented as optimizers. For example, they might try to find an assignment that satisfies as many clauses as possible, or they might try to minimise the size of the clauses using resolution.
I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?
I should clarify that I'm not necessarily saying that there can't be cases in which a system that is believed or intended to be an optimizer_1 might become or turn out to be an optimizer_2 – I have not really argued for or against this. What I want to do is enable clearer thinking about issue, so that one does not slide between these two concepts without noticing.
Yes, agents with different preferences are incentivised to cooperate provided that the cost of enforcing cooperation is less than the cost of conflict. Agreeing to adopt a shared utility function via acausal trade might potentially be a very cheap way to enforce cooperation, and some agents might do this just based on their prior. However, this is true for any agents with different preferences, not just agents of the type I described. You could use the same argument to say that you are in general unlikely to find two very intelligent agents with different utility functions.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.