If you have any questions for me or just want to talk, feel free to reach out by sending a private message on this site or by sending an e-mail to egeerdil96@gmail.com.
You can also find me on Metaculus at https://www.metaculus.com/accounts/profile/116023/, or on Discord using the tag Starfall#7651.
It's my assumption because our brains are AGI for ~20 W.
I think that's probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today's neural networks.
The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I'm more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.
I'm posting this as a separate comment because it's a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.
If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the "internal state" of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be "fully trained" by the age of 30, say.
An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.
2e6 eV are spent per FP16 operation... This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let's pretend it's O(1))
2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.
This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.
An H100 SXM has 8e10 transistors, 2e9 Hz boost frequency,
70 W700 W of max power consumption...
8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it's able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.
Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we've calculated. However, it's important that most of the inefficiency comes from the former and not the latter. I'll elaborate on this later in the comment.
This seems pretty inefficient to me!
I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.
Positing that brains are ~6 orders of magnitude more energy efficient than today's transistor circuits doesn't seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
Here we're talking about the brain possibly doing 1e20 FLOP/s, which I've previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.
The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.
I just don't see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you'll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.
I'll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn't violate Landauer or any laws of physics, because we know it happens.
I don't think you'll lose as much relative to Landauer when you're doing that, because you don't have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that's where you lose most of the energy. In terms of pure switching efficiency of transistors, we're already pretty close to this kind of biological process, as I've calculated above.
I don't think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That's going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.
I think what I don't understand is why you're defaulting to the assumption that the brain has a way to store and update information that's much more efficient than what we're able to do. That doesn't sound like a state of ignorance to me; it seems like you wouldn't hold this belief if you didn't think there was a good reason to do so.
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don't see how that follows at all.
Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don't see how that's going to work when you're (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.
I think what I'm saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I've certainly not come across anything like that before.
To what extent do information storage requirements weigh on FLOPS requirements? It's not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it's possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.
My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that's O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.
I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I'm reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.
First, I'm confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we've done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.
As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x -> B*x is almost never injective. This means even in your idealized setup, the operation (A, B) -> (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.
The bigger problem is that logical reversibility doesn't imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it's performing 1e20 FLOP/s). At high frequencies, I just don't see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.
Again using your setup, if you actually tried to implement (A, B) -> (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that's at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you'll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn't have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Ajeya Cotra estimates training could take anything from 1e24 to 1e54 floating point operations, or even more. Her narrower lifetime anchor ranges from 1e24 to 1e38ish.
Two points about that:
This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we're already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
Do you think Cotra's estimates are not just poor, but crazy as well?
No, because Cotra doesn't claim that the human brain performs 1e25 FLOP/s - her claim is quite different.
The claim that "the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run" is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.
The unpopularity of the war in early 1917 is rather overstated. In fact; even after the fall of the Tsarist government, the war was so popular that before Lenin returned to Russia, Stalin felt it necessary to change the Bolshevik party line by endorsing Russia's continued participation in the war.
I agree that the chaotic conditions in 1917 Russia were essential for a minority to seize power, but similarly chaotic conditions could come to exist in many Western countries as well, perhaps as the result of a world war or economic transformation driven by AI.
I don't have a source for this claim off the top of my head, but I've previously read that Germany was actually a net beneficiary of international financial transactions in the 1920s. Essentially, the flow of funds went like this:
It would be nice if someone could check whether this is true or not, but the impression I got from reading the history here is that the role of war reparations in causing fiscal problems for Germany was inflated by propaganda, especially by German politicians who tried to blackmail the Allies into lowering the amount of reparations to be paid by raising the specter of economic collapse in Germany.
Probably true, and this could mean the brain has some substantial advantage over today's hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;