DaemonicSigil - LessWrong

Linkpost for: https://pbement.com/posts/threads.html

Today's interesting number is 961.

Say you're writing a CUDA program and you need to accomplish some task for every element of a long array. Well, the classical way to do this is to divide up the job amongst several different threads and let each thread do a part of the array. (We'll ignore blocks for simplicity, maybe each block has its own array to work on or something.) The method here is as follows:

for (int i = threadIdx.x; i < array_len; i += 32) {
    arr[i] = ...;
}

So the threads make the following pattern (if there are threads):

⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫

This is for an array of length $l = 26$ . We can see that the work is split as evenly as possible between the threads, except that threads 0 and 1 (black and brown) have to process the last two elements of the array while the rest of the threads have finished their work and remain idle. This is unavoidable because we can't guarantee that the length of the array is a multiple of the number of threads. But this only happens at the tail end of the array, and for a large number of elements, the wasted effort becomes a very small fraction of the total. In any case, each thread will loop $⌈ \frac{l}{t} ⌉ = ⌈ \frac{26}{8} ⌉ = 4$ times, though it may be idle during the last loop while it waits for the other threads to catch up.

One may be able to spend many happy hours programming the GPU this way before running into a question: What if we want each thread to operate on a continguous area of memory? (In most cases, we don't want this.) In the previous method (which is the canonical one), the parts of the array that each thread worked on were interleaved with each other. Now we run into a scenario where, for some reason, the threads must operate on continguous chunks. "No problem" you say, we simply need to break the array into chunks and give a chunk to each thread.

const int chunksz = (array_len + blockDim.x - 1)/blockDim.x;
for (int i = threadIdx.x*chunksz; i < (threadIdx.x + 1)*chunksz; i++) {
    if (i < array_len) {
        arr[i] = ...;
    }
}

If we size the chunks at 3 items, that won't be enough, so again we need $⌈ l / t ⌉ = 4$ items per chunk. Here is the result:

⬛⬛⬛⬛🟫🟫🟫🟫🟥🟥🟥🟥🟧🟧🟧🟧🟨🟨🟨🟨🟩🟩🟩🟩🟦🟦

Beautiful. Except you may have noticed something missing. There are no purple squares. Though thread 6 is a little lazy and doing 2 items instead of 4, thread 7 is doing absolutely nothing! It's somehow managed to fall off the end of the array.

Unavoidably, some threads must be idle for $⌈ l / t ⌉ t - l = 6$ loops. This is the conserved total amount of idleness. With the first method, the idleness is spread out across threads. Mathematically, the amount of idleness can be no greater than $t - 1$ regardless of array length and thread number, and so each thread will be idle for at most 1 loop. But in the contiguous method, the idleness is concentrated in the last threads. There is nothing mathematically impossible about having $⌈ l / t ⌉ t - l$ as big as $⌈ l / t ⌉$ or bigger, and so it's possible for an entire thread to remain unused. Multiple threads, even. Eg. take $l = 9$ :

⬛⬛🟫🟫🟥🟥🟧🟧🟨

3 full threads are unused there! Practically, this shouldn't actually be a problem, though. The number of serial loops is still the same, and the total number of idle loops is still the same. It's just distributed differently. The reasons to prefer the interleaved method to the contiguous method would be related to memory coalescing or bank conflicts. The issue of unused threads would be unimportant.

We don't always run into this effect. If $l$ is a multiple of $t$ , all threads are fully utilized. Also, we can guarantee that there are no unused threads for $l$ larger than a certain maximal value. Namely, take $l = (t - 1) ²$ then $⌈ (t - 1) ² / t ⌉ = t - 1$ and so the idleness is $t (t - 1) - (t - 1) ² = t - 1 \geq ⌈ l / t ⌉ = t - 1$ . But if $l$ is larger than this, then one can show that all threads must be used at least a little bit.

So, if we're using $t = 32$ CUDA threads, then when the array size is 961, the contiguous processing method will leave thread 31 idle. And 961 is the largest array size for which that is true.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil11d20

So once that research is finished, assuming it is successful, you'd agree that many worlds would end up using fewer bits in that case? That seems like a reasonable position to me, then! (I find the partial-trace kinds of arguments that people make pretty convincing already, but it's reasonable not to.)

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil24d20

MW theories have to specify when and how decoherence occurs. Decoherence isn't simple.

They don't actually. One could equally well say: "Fundamental theories of physics have to specify when and how increases in entropy occur. Thermal randomness isn't simple." This is wrong because once you've described the fundamental laws and they happen to be reversible, and also aren't too simple, increasing entropy from a low entropy initial state is a natural consequence of those laws. Similarly, decoherence is a natural consequence of the laws of quantum mechanics (with a not-too-simple Hamiltonian) applied to a low entropy initial state.

Ackshually, many worlds is wrong

DaemonicSigil25d130

Good post, and I basically agree with this. I do think it's good to mostly focus on the experimental implications when talking about these things. When I say "many worlds", what I primarily mean is that I predict that we should never observe a spontaneous collapse, even if we do crazy things like putting conscious observers into superposition, or putting large chunks of the gravitational field into superposition. So if we ever did observe such a spontaneous collapse, that would falsify many worlds.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil25d20

Amount of calculation isn't so much the concern here as the amount of bits used to implement that calculation. And there's no law that forces the amount of bits encoding the computation to be equal. Copenhagen can just waste bits on computations that MWI doesn't have to do.

In particular, I mentioned earlier that Copenhagen has to have rules for when measurements occur and what basis they occur in. How does MWI incur a similar cost? What does MWI have to compute that Copenhagen doesn't that uses up the same number of bits of source code?

Like, yes, an expected-value-maximizing agent that has a utility function similar to ours might have to do some computations that involve identifying worlds, but the complexity of the utility function doesn't count against the complexity of any particular theory. And an expected value maximizer is naturally going to try and identify its zone of influence, which is going to look like a particular subset of worlds in MWI. But this happens automatically exactly because the thing is an EV-maximizer, and not because the laws of physics incurred extra complexity in order to single out worlds.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil25d108

Right, so we both agree that the randomness used to determine the result of a measurement in Copenhagen, and the information required to locate yourself in MWI is the same number of bits. But the argument for MWI was never that it had an advantage on this front, but rather that Copenhagen used up some extra bits in the machine that generates the output tape in order to implement the wavefunction collapse procedure. (Not to decide the outcome of the collapse, those random bits are already spoken for. Just the source code of the procedure that collapses the wavefunction and such.) Such code has to answer questions like: Under what circumstances does the wavefunction collapse? What determines the basis the measurement is made in? There needs to be code for actually projecting the wavefunction and then re-normalizing it. This extra complexity is what people mean when they say that collapse theories are less parsimonious/have extra assumptions.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil26d40

Disagree.

If you're talking about the code complexity of "interleaving": If the Turing machine simulates quantum mechanics at all, it already has to "interleave" the representations of states for tiny things like a electrons being in a superposition of spin states or whatever. This must be done in order to agree with experimental results. And then at that point not having to put in extra rules to "collapse the wavefunction" makes things simpler.

If you're talking about the complexity of locating yourself in the computation: Inferring which world you're in is equally complex to inferring which way all the Copenhagen coin tosses came up. It's the same number of bits. (In practice, we don't have to identify our location down to a single world, just as we don't care about the outcome of all the Copenhagen coin tosses.)

General Thoughts on Secular Solstice

DaemonicSigil1mo20

This notion of faith seems like an interesting idea, but I'm not 100% sure I understand it well enough to actually apply it in an example.

Suppose Descartes were to say: "Y'know, even if there were an evil Daemon fooling every one of my senses for every hour of the day, I can still know what specific illusions the Daemon is choosing to show me. And hey, actually, it sure does seem like there are some clear regularities and patterns in those illusions, so I can sometimes predict what the Daemon will show me next. So in that sense it doesn't matter whether my predictions are about the physical laws of a material world, or just patterns in the thoughts of an evil being. My mental models seem to be useful either way."

Is that what faith is?

If a rationalist hates the idea of heat death enough that they fool themselves into thinking that there must be some way that the increase in entropy can be reversed, is that an example of not seeing the world as it is? How does this flow from a lack of the first thing?

"Deep Learning" Is Function Approximation

DaemonicSigil1mo40

To be clear, I'm definitely pretty sympathetic to TurnTrout's type error objection. (Namely: "If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn't particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.") But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.

eg. Zack writes: "obviously the line itself does not somehow contain a representation of general squared-error-minimization". So in this line fitting example, the loss function, i.e. "general squared-error-minimization" refers to the function , and not $L (f_{θ})$ .
And when he asks why one would even want the neural network to represent the loss function, there's a pretty obvious answer of "well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it's probably better if the model understands the difference between good and bad outcomes for this application." But this answer only applies to the curried loss.

I wasn't trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can't reply to the rest of your comment which is pretty insightful.

"Deep Learning" Is Function Approximation

DaemonicSigil1mo105

It's the same thing for piecewise-linear functions defined by multi-layer parameterized graphical function approximators: the model is the dataset. It's just not meaningful to talk about what a loss function implies, independently of the training data. (Mean squared error of what? Negative log likelihood of what? Finish the sentence!)

This confusion about loss functions...

I don't think this is a confusion, but rather a mere difference in terminology. Eliezer's notion of "loss function" is equivalent to Zack's notion of "loss function" curried with the training data. Thus, when Eliezer writes about the network modelling or not modelling the loss function, this would include modelling the process that generated the training data.

LESSWRONG
LW

Posts

Wiki Contributions

Comments