All of Joar Skalse's Comments + Replies

Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.

""
The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?"
""

This feels like a bit of a digression, but we do have concrete examples of systems... (read more)

The general rule I'm following is "if the argument would say false things about humans, then don't update on it".

Yes, this is of course very sensible. However, I don't see why these arguments would apply to humans, unless you make some additional assumption or connection that I am not making. Considering the rest of the conversation, I assume the difference is that you draw a stronger analogy between brains and deep learning systems than I do?

I want to ask a question that goes something like "how correlated is your credence that arguments 5-10 apply to hum... (read more)

2Rohin Shah5mo
Okay, I'll take a stab at this. "The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?" *I expect that this isn't a fully accurate description of human brains, but I expect that if we did write the full description the argument would sound the same. "This, in turn, suggests a data structure that is discrete and combinatorial, with syntax trees, etc, and humans do (according to the argument) not use such representations. We should therefore expect humans to at some point hit a wall or limit to what they are able to do." (I find it hard to make the argument here because there is no argument -- it's just flatly asserted that neural networks don't use such representations, so all I can do is flatly assert that humans don't use such representations. If I had to guess, you would say something like "matrix multiplications don't seem like they can be discrete and combinatorial", to which I would say "the strength of brain neuron synapse firings doesn't seem like it can be discrete and combinatorial".) We know, from computer science, that it is very powerful to be able to reason in terms of variables and operations on variables. It seems hard to see how you could have human-level intelligence without this ability. However, humans do not typically have this ability, with most human brains instead being more analogous to Boolean circuits, given their finite size and architecture of neuron connections. In this one I'd give the protein folding example, but apparently you think you'd be able to fold proteins just as well as you'd be able to play chess if they

But for all of them except argument 6, it seems like the same argument would imply that humans would not be generally intelligent.


Why is that?

Because text on the Internet sometimes involves people using logic, reasoning, hypothesis generation, analyzing experimental evidence, etc, and so plausibly the simplest program that successfully predicts that text would do so by replicating that logic, reasoning etc, which you could then chain together to make scientific progress.

What does the argument say in response?

There are a few ways to respond.

First of all, wh... (read more)

3Rohin Shah5mo
Meta: A lot of this seems to have the following form: I think you are misunderstanding what I am trying to do here. I'm not trying to claim that humans and neural networks will have the same properties or be identical. I'm trying to evaluate how much I should update on the particular argument you have provided. The general rule I'm following is "if the argument would say false things about humans, then don't update on it". It may in fact be the case that humans and neural networks differ on that property, but if so it will be for some other reason. There is a general catchall category of "maybe something I didn't think about makes humans and neural networks different on this property", and indeed I even assign it decently high probability, but that doesn't affect how much I should update on this particular argument. Responding to particular pieces: The rest of the comment was justifying that. I'm not seeing why that's evidence for the perspective. Even when word order is scrambled, if you see "= 32 44 +" and you have to predict the remaining number, you should predict some combination of 76, 12, and -12 to get optimal performance; to do that you need to be able to add and subtract, so the model presumably still develops addition and subtraction circuits. Similarly for text that involves logic and reasoning, even after scrambling word order it would still be helpful to use logic and reasoning to predict which words are likely to be present. The overall argument for why the resulting system would have strong, general capabilities seems to still go through. In addition, I don't know why you expect that intelligence can't be implemented through "a truly massive ensemble of simple heuristics". Huh, really? I think that's pretty plausible, for all the same reasons that I think it's plausible in the neural network case. (Though not as likely, since I haven't seen the scaling laws for random forests extend as far as in the neural network case, and the analogy to human

"human-level question answering is believed to be AI-complete" - I doubt that. I think that [...]

Yes, these are also good points. Human-level question answering is often listed as a candidate for an AI-complete problem, but there are of course people who disagree. I'm inclined to say that question-answering probably is AI-complete, but that belief is not very strongly held. In your example of the painter; you could still convey a low-resolution version of the image as a grid of flat colours (similar to how images are represented in computers), and tell the... (read more)

1Ben Amitay5mo
Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest: About the loss function for next word prediction - my point was that I'm not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc. About solving discrete representations with architectural change - I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods. About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.

Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution. 

2ChristianKl5mo
That's not true for maximum-likelihood distribution is general. It's been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.  To get back to the actual case of large language models, imagine there's a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01. A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world. 

The benchmarks tell you about what the existing systems do. They don't tell you about what's possible.

Of course. It is almost certainly possible to solve the problem of catastrophic forgetting, and the solution might not be that complicated either. My point is that it is a fairly significant problem that has not yet been solved, and that solving it probably requires some insight or idea that does not yet exist. You can achieve some degree of lifelong learning through regularised fine-tuning, but you cannot get anywhere near what would be required for human... (read more)

If you look at a random text on the internet it would be very surprising if every word in it is the most likely word to follow based on previous words. 

I'm not completely sure what your point is here. Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times. Would it be surprising if you then get heads 10 times in a row? Yes, in a sense. But that is still the most likely individual sequence.

The step from GTP3 to InstructGPT and ChatGPT was not one of scaling up in terms of size of models and sub

... (read more)
2ChristianKl5mo
The benchmarks tell you about what the existing systems do. They don't tell you about what's possible. One of OpenAI's current projects is to figure out how to extract from the conversations that ChatGPT has valuable data for fine-tuning.  There's no fundamental reason why it can't extract from the conversation it has all the relevant information and do fine-tuning to add it to its long-term memory.  When it comes to ToS violations it seems evident that such a system is working, based on my interactions with it. ChatGPT has basically three ways to answer with normal text, with red text, and with custom answers which explain to you why it won't answer your query. Both the red text answers and the custom answers increased over a variety of different prompts. When it does its red text answers there's a feedback button to tell them if you think it made a mistake.  To me, it seems obvious that those red-text answers get used as training material for fine-tuning and that this helps with detecting similar cases in the future.  You could summarize InstructGPT's lesson as "You can get huge capability gains by comparably minor things added on top". You can talk about how they are minor at a technical level but that doesn't change the fact that these minor things produce huge capability gains.  In the future, there's also a lot of additional room to get more clever about providing training data. 
2ChristianKl5mo
That's a different case.  If you have a text you can calculate for every word in the text the likelihood (L_text) how likely it would follow the preceding words in the text. You can also calculate the likelihood (L_ideal) of the most likely word that would follow the preceding text. L_ideal - L_text is in Kahemann's words noise. If you look at a given text you can calculate the average of the noise for each word.  The average noise that's produced by GPT3 is less than that of the average text on the internet. It would be surprising to encounter texts with so little noise randomly on the internet. 

What I'm suggesting is that volume in high-dimensions can concentrate on the boundary.

Yes. I imagine this is why overtraining doesn't make a huge difference.

Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.

See e.g., page 47 in the main paper.

1Zachary Robertson2y
You seem to have updated your opinion: overtraining does make difference, but it’s not ‘huge’. Have you run a significance test for your lines of best fit? The plots as presented suggest the effect is significant. Figure C.1.a indicates the tilting phenomena. Probabilities only go up to one so tilting down means that the most likely candidates from overstrained SGD are less likely with random sampling. Thus, unlikely random sampling candidates are more likely under SGD. At the tail, the opposite happens. Functions more likely with random sampling become less likely under SGD. While the optimizer has a larger effect, I think the subtle question is whether the overtraining tilts in the same way each time. Figure 16 indicates yes again. This phenomena you consider to be minor is what I found most interesting about the paper.

(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)

I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".

This feels similar to:

Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.

(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne

... (read more)

I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing. 

I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the... (read more)

I have a few comments on this:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.

The main point, as I see it, is essentially tha... (read more)

1Zachary Robertson2y
What I'm suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I'm talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error. This does seem to invalidate the model. However, something tells me that the difference here is more about degree. Since you use the word 'should' I'll use the wiggle room to propose an argument for what 'should' happen. If SGD is run with early stopping, as described above, then my argument is that this is roughly equivalent to random sampling via an appeal to concentration of measure in high-dimensions. If SGD is not run with early stopping, it's enclosed by the boundary of zero train error functions. Because these are most likely in the interior these functions are unlikely to be produced by random sampling. Thus, on a log-log plot I'd expect overtraining to 'tilt' the correspondence between SGD and random sampling likelihoods downward. Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
3Rohin Shah2y
This seems right, but I'm not sure how that's different from Zach's phrasing of the main point? Zach's phrasing was "SGD approximately equals random sampling", and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said. (Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?) This feels similar [https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world] to: Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action. (This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.) I've discussed this above [https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of?commentId=KFuGWt3FJ288T4tbB], I'll continue the discussion there. The rest of the comment is about stuff that I didn't have a strong opinion on, so I'll leave it for Zach to answer if he wants.

This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.

But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people i... (read more)

2Rohin Shah2y
I agree it's not vacuous. It sounds like you're mostly stating the same argument I gave but in different words. Can you tell me what's wrong or missing from my summary of the argument? (Even if you are talking about the overparameterized case, where this argument is not vacuous and also applies primarily to neural nets and not other ML models, I don't find this argument very compelling a priori, though I agree that based on empirical evidence it is probably mostly correct.)

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

We’re not saying that discr... (read more)

9nostalgebraist2y
  I'll write mostly about this statement, as I think it's the crux of our disagreement. The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure. However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation. (This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.) In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one. Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory [https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory]. Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R: 1. The standard normal, 2. The standard normal, but you chop

Yes, it does of course apply in that sense. 

I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again). 

I'm not sure if accounting for the floating-point implementation is informative or not in this case.

2evhub2y
My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's not a reasonable/useful level of abstraction).

There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.

Ah, I certainly agree with this.

I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training... (read more)

4johnswentworth2y
That's a clever example, I like it. Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it's just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.

I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.

4johnswentworth2y
How so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there's some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness? (Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there's a bias towards functions which can be compressed in more general ways.)

Yes, I agree with this. 

I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and bat... (read more)

3Daniel Kokotajlo2y
I'd be interested to hear your take on why this means NN's are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best. Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I've never heard anyone try to argue that Solomonoff Induction "merely interpolates" before!

I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:

The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express... (read more)

We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural netw... (read more)

We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.

It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.

First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means tha... (read more)

2adamShimi2y
Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.
4evhub2y
In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
3Daniel Kokotajlo2y
Thanks, this is great! OK, I'm pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.

Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions". 

The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accura... (read more)

3Daniel Kokotajlo2y
Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff -- if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)

As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).

In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb's Problem as a Marko... (read more)

I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.

I also agree with what ofer said.

"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 b... (read more)

2Gordon Seidoh Worley4y
I think there is only a question of how leaky, but it is always non-zero amounts of leaky, which is the reason Bostrom and others are concerned about it for all optimizers and don't bother to make this distinction.

The fact that a superintelligent AI contains an optimization algorithm does not necessarily mean that this optimization algorithm is itself superintelligent (or that it has access to the world model of the overall system, etc). It might, it might not – it depends on the design of the system.

"the Cartesian boundary is an issue once we talk about self-improving AI."
This presumably depends on a lot of specific facts about how the system is designed.

2Bunthut4y
It doesnt need to. The "inner" programm could also use its hardware as quasi-sense organs and figure out a world model of its own. Of course this does depend on the design of the system. In the example described, you could, rather then optimize for speed itself, have a fixed function that estimates speed (like what we do in complexity theory) and then optimize for *that*, and that would get rid of the leak in question. The point I think Bostrom is making is that contrary to intuition, just building the epistemic part of an AI and not telling it to enact the solution it found doesnt guarantee you dont get an optimizer_2.

I agree with everything you have said.

I know these things. Nothing you have said contradicts my point, as far as I can see. The point I am making here is one of conceptual clarification, which the intent of enabling more clear thinking and reasoning.

You seem to be talking about a system that outputs "plans that, if implemented, would achieve X" (roughly), and your point seems to be that such a system would be likely to be or behave like an optimizer_2. I find this claim quite plausible (and fully compatible with the point I'm making).

"It would require a selective blindne... (read more)

1Steven Byrnes4y
RE "make the superintelligence assume that it is disembodied"—I've been thinking about this a lot recently (see The Self-Unaware AI Oracle [https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle/]) and agree with Viliam that knowledge-of-one's-embodiment should be the default assumption. My reasoning is: A good world-modeling AI should be able to recognize patterns and build conceptual transformations between any two things it knows about, and also should be able to do reasoning over extended periods of time. OK, so let's say it's trying to figure something out something about biology, and it visualizes the shape of a tree. Now it (by default) has the introspective information "A tree has just appeared in my imagination!". Likewise, if it goes through any kind of reasoning process, and can self-reflect on that reasoning process, then it can learn (via the same pattern-recognizing algorithm it uses for the external world) how that reasoning process works, like "I seem to have some kind of associative memory, I seem to have a capacity for building hierarchical generative models, etc." Then it can recognize that these are the same ingredients present in those AGIs it read about in the newspaper. It also knows a higher-level pattern "When two things are built the same way, maybe they're of the same type." So now it has a hypothesis that it's an AGI running on a computer. It may be possible to prevent this cascade of events, by somehow making sure that "I am imagining a tree" and similar things never get written into the world model. I have this vision of two data-types, "introspective information" and "world-model information", and your static type-checker ensures that the two never co-mingle. And voila, AI Safety! That would be awesome. I hope somebody figures out how to do that, because I sure haven't. (Admittedly, I have neither time nor relevant background knowledge to try properly.) I'm also slightly concerned that, even if you figure out a w

The argument I quoted does not mention evolution. I'm not saying that the argument can't be patched, I'm saying that the argument is inadequate as stated. I should note, however, that evolution selects organisms based on their ability to do optimisation_2, not their ability to do optimisation_1. It's therefore not clear when and how you can simply "generalise from evolution".

1Pattern4y
I meant the reason people think the jump is possible is because of evolution. (Evolution was just running, business as usual, and suddenly! We're on the moon!) I have seen better versions of this argument that mention humans or evolutionary algorithms explicitly. And you're right - the argument isn't clear on that. The jump is 'a search for better plans will find better plans'.

A linear program solver is a system that maximises or minimises a linear function subject to non-strict linear constraints.

Many SAT-solvers are implemented as optimizers. For example, they might try to find an assignment that satisfies as many clauses as possible, or they might try to minimise the size of the clauses using resolution.

3Logan Riggs4y
Thanks for the explanation and links! I agree that linear program solvers are intrinsically optimizing. SAT-solvers not intrinsically so, but could be used to optimize (such as Pattern’s example in his comment). Moving on, I think it’s hard to define “environment” in a way that isn’t arbitrary, but it’s easier to think about “everything this optimizer affects”. For every optimizer, I could (theoretically) list everything that it affects at that moment, but that list changes based off what’s around the optimizer to be affected. I could rig a set up such that, as soon as I run the optimizer code, it would cause a power surge and blow a fuse. Then, every optimizer has changed the environment while optimizing. But you may argue that an opt2 would purposely choose an action that would affect the environment (especially if choosing that action maximizes reward) while an opt1 wouldn’t even have that action available. But while available actions may be constrained by the type of optimizer, and you could try to make a distinction between different available actions, the affects of those limited actions changes with the programming language, the hardware, etc. I’m still confused on this, and you still may have made a good distinction between different optimizers.

I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?

3Bunthut4y
Well, one thing a powerful optimizer might do at some point is ask itself "what programm should I run that will figure out such and such for me". This is what Bostrom is describing in the quote, an optimizer optimizing its own search process. Now, if the AI then searches through the space of possible programms, predicts which one will give it the answer quickest, and then implements it, heres a thing that might happen: There might be a programm that, when ran, affects the outside world in such a way as to speed up the process of answering. For example, it might lead electricity to run through the computer in such a way as to cause it to emit electromagnetic waves, through which it sends a message to a nearby w-lan router and the uses the internet to hack a bank account to buy extra hardware and have it delivered to and pluged into itself, and the it runs a programm calculating the answer on this much more powerful hardware, and in this way ends up having the answer faster then if it had just started calculating away on the weaker hardware. And if the optimizer works as described above, it will implement that programm, and thereby optimize its enviroment. Notably, it will optimize for solving the original optimisation problem faster/better, not try to implement the solution to it it has found. I dont think this makes your distinction useless, as there are genuine system_1 optimizers, even relatively powerful ones, but the Cartesian boundary is an issue once we talk about self-improving AI.
8Gordon Seidoh Worley4y
Sure, let's be super specific about it. Let's say we have something you consider an optimizer_1, a SAT solver. It operates over a set of variables V arranged in predicts P using an algorithm A. Since this is a real SAT solver that is computed rather than a purely mathematical one we think about, it runs on some computer C and thus for each of V, P, and A there is some C(V), C(P), and C(A) that is the manifestation of each on the computer. We can conceptualize what C does to V, P, and A in different ways: it turns them into bytes, it turns A into instructions, it uses C(A) to operate on C(V) and C(P) to produce a solution for V and P. Now the intention is that the algorithm A is an optimizer_1 that only operates on V and P, but in fact A is never run, properly speaking, C(A) is, and we can only say A is run to the extent C(A) does something to reality that we can set up an isomorphism to A with. So C(A) is only an optimizer_1 to the extent the isomorphism holds and it is, as you defined optimizer_1, "solving a computational optimization problem". But properly speaking C(A) doesn't "know" it's an algorithm: it's just matter arranged in a way that is isomorphic, via some transformation, to A. So what is C(A) doing then to produce a solution? Well, I'd say it "optimizes its environment", that is literally the matter and its configuration that it is in contact with, so it's an optimizer_2. You might object that there's something special going on here such that C(A) is still an optimizer_1 because it was set up in a way that isolates it from the broader environment so it stays within the isomorphism, but that's not a matter of classification, that's an engineering problem of making an optimizer_2 behave as if it were an optimizer_1. And a large chunk of AI safety (mostly boxing) is dealing with ways in which, even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it "breaks" the isomorp

I should clarify that I'm not necessarily saying that there can't be cases in which a system that is believed or intended to be an optimizer_1 might become or turn out to be an optimizer_2 – I have not really argued for or against this. What I want to do is enable clearer thinking about issue, so that one does not slide between these two concepts without noticing.

Yes, agents with different preferences are incentivised to cooperate provided that the cost of enforcing cooperation is less than the cost of conflict. Agreeing to adopt a shared utility function via acausal trade might potentially be a very cheap way to enforce cooperation, and some agents might do this just based on their prior. However, this is true for any agents with different preferences, not just agents of the type I described. You could use the same argument to say that you are in general unlikely to find two very intelligent agents with different utility functions.

3cousin_it5y
Agents with identical source code will reason identically before seeing any observations, so the "acausal trade" in this case barely feels like trade at all, just making your preferences updateless over possible future observations. That's much simpler than acausal trade between agents with different source code, which we can't even formalize yet.