I have a compute-market startup called vast.ai, I also do research for Orchid (crypto), and I'm working towards a larger plan to save the world. Currently seeking networking, collaborators, and hires - especially top notch cuda/gpu programmers.
My personal blog: https://entersingularity.wordpress.com/
-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet
It is ridiculous to interpret this as some general algo efficiency improvement - it's a specific improvement in a specific measure (flops) which doesn't even directly translate into equivalent wall-clock time performance, and is/was already encapsulated in sparsity techniques.
There has been extremely little improvement in general algorithm efficiency, compared to hardware improvement.
Electrons are very light so the kinetic energy required to get them moving should not be significant in any non-contrived situation I think? The energy of the magnetic field produced by the current would tend to be much more of an important effect.
My current understanding is that the electric current energy transmits through electron drift velocity (and I believe that is the standard textbook understanding?, although I admit I have some questions concerning the details). The magnetic field is just a component of the EM waves which propagate changes in electron KE between electrons (the EM waves implement the connections between masses in the equivalent mass-spring system).
I'm not sure how you got "1 kT per foot" but that seems roughly similar to the model up thread I am replying to from spxtr that got 0.05 fJ/bit/mm or 5e-23 J/bit/mm. I attempted to derive an estimate from the lower level physics thinking it might be different but it ended up in the same range - and also off by the same 2 OOM vs real data. But I mention that skin effect could plausibly increase power by 10x in my lower level model, as I didn't model it nor use measured attenuation values at all. The other OOM probably comes from analog SNR inefficiency.
The part of this that is somewhat odd at first is the exponential attenuation. That does show up in my low lever model where any electron kinetic energy in the wire is dissipated by about 50% due to thermal collisions every ~ 4e-13 seconds (that is the important part from mean free path / relaxation time). But that doesn't naturally lead to a linear bit energy distance scale unless that dissipated energy is somehow replaced/driven by the preceding section of waveform.
So if you sent as a single large infinitesimal pulse down a wire of length , the energy you get on the other side is for some attenuation constant that works out to about 0.1 mm or something as it's , not meters. I believe if your chart showed attenuation in the 100THZ regime on the scale of it would be losing 50% per 0.1 mm instead of per meter.
We know that resistance is linear, not exponential - which I think arises from long steady flow where every seconds half the electron kinetic energy is dissipated, but this total amount is linear with wire section length. The relaxation time then just determines what steady mean electron drift velocity (current flow) results from the dissipated energy.
So when the wave period is much less than you still lose about half of the wave energy every seconds but that can be spread out over a much larger wavelength section. (and indeed at gigahertz frequencies this model roughly predicts the correct 50% attenuation distance scale of ~10m or so).
This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social
There's plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems work as well as they do
These two research tracks actually end up being highly entangled/convergent, they don't disentangle cleanly in the way you/we would like.
Some basic examples:
successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress (it's absolutely insane that people often try to build complex DL systems without the kinds of detailed debugging/analysis tools that are useful in many related fields such as for computer graphics pipelines. You can dramatically accelerate progress when you can quickly visualize/understand your model's internal computations on a gears level vs the black box alchemy approach).
Deep understanding of neuroscience mechanisms could advance safer brain-sim ish approaches, help elucidate practical partial alignment mechanisms (empathic altruism, prosociality, love, etc), but also can obviously accelerate DL capabilities.
Better approximations of universal efficient active learning (empowerment/curiosity) are obviously dangerous capability wise, but also seem important for alignment by modeling/bounding human utility when externalized.
In the DL paradigm you can't easily separate capabilities and alignment, and forcing that separation seems to constrain us to approaches that are too narrow/limiting to be relevant on short timelines.
See my reply here.
Using steady state continuous power attenuation is incorrect for EM waves in a coax transmission line. It's the difference between the small power required to maintain drift velocity against frictive resistance vs the larger energy required to accelerate electrons up to the drift velocity from zero for each bit sent.
I am skeptical that steady state direct current flow attenuation is the entirety of the story (and indeed it seems to underestimate actual coax cable wire energy of ~1e^-21 to 5e^-21 J/bit/nm by a few OOM).
For coax cable the transmission is through a transverse (AC) wave that must accelerate a quantity of electrons linearly proportional to the length of the cable. These electrons rather rapidly dissipate this additional drift velocity energy through collisions (resistance), and the entirety of the wave energy is ultimately dissipated.
This seems different than sending continuous DC power through the wire where the electrons have a steady state drift velocity and the only energy required is that to maintain the drift velocity against resistance. For wave propagation the electrons are instead accelerated up from a drift velocity of zero for each bit sent. It's the difference between the energy required to accelerate a car up to cruising speed and the power required to maintain that speed against friction.
If we take the bit energy to be , then there is a natural EM wavelength of , so , which works out to ~1um for ~1eV. Notice that using a lower frequency / longer wavelength seems to allow one to arbitrarily decrease the bit energy distance scale, but it turns out this just increases the dissipative loss.
So an initial estimate of the characteristic bit energy distance scale here is ~1eV/bit/um or ~1e-22 J/bit/nm. But this is obviously an underestimate as it doesn't yet include the effect of resistance (and skin effect) during wave propagation.
The bit energy of one wavelength is implemented through electron peak drift velocity on order , where is the number of carrier electrons in one wavelength wire section. The relaxation time or mean time between thermal collisions with a room temp thermal velocity of around ~1e5 m/s and the mean free path of ~40 nm in copper is ~ 4e-13s. Meanwhile the inverse frequency or timespan of one wavelength is around 3e-14 s for an optical frequency 1eV wave, and is ~1e-9 s for a more typical (much higher amplitude) gigahertz frequency wave. So it would seem that resistance is quite significant on these timescales.
Very roughly the gigahertz 1e-9s period wave requires about 5 oom more energy per wavelength due to dissipation which cancels out the 5 oom larger distance scale. Each wavelength section loses about half of the invested energy every ~ 4e-13 seconds, so maintaining the bit energy of requires roughly input power of ~ for seconds which cancels out the effect of the longer wavelength distance, resulting in a constant bit energy distance scale independent of wavelength/frequency (naturally there are many other complex effects that are wavelength/frequency dependent but they can’t improve the bit energy distance scale )
For a low frequency (long wavelength) with << :
~ 1eV / 10um ~ 1e-23 J/bit/nm
If you take the bit energy down to the minimal landauer limit of ~0.01 eV this ends up about equivalent to your lower limit, but I don't think that would realistically propagate.
A real wave propagation probably can’t perfectly transfer the bit energy over longer distances and has other losses (dielectric loss, skin effect, etc), so vaguely guesstimating around 100x loss would result in ~1e-21 J/bit/nm. The skin effect alone perhaps increases resistance by roughly 10x at gigahertz frequencies. Coax devices also seem constrained to use specific lower gigahertz frequences and then boost the bitrate through analog encoding, so for example 10-bit analog increases bitrate by 10x at the same frequency but requires about 1024X more power, so that is 2 OOM less efficient per bit.
Notice that the basic energy distance scale of is derived from the mean free path, via the relaxation time from , where is the mean free path and is the thermal noise velocity (around ~1e5 m/s for room temp electrons).
Coax cable doesn't seem to have any fundamental advantage over waveguide optical, so I didn't consider it at all in brain efficiency. It requires wires of about the same width several OOM larger than minimal nanoscale RC interconnect and largish sending/receiving devices as in optics/photonics.
The heat lost is then [..] 0.05 fJ/bit/mm. Quite low compared to Jacob's ~10 fJ/mm "theoretical lower bound."
In the original article I discuss interconnect wire energy, not a "theoretical lower bound" for any wire energy communication method - and immediately point out reversible communication methods (optical, superconducting) that do not dissipate the wire energy.
Coax cable devices seem to use around 1 to 5 fJ/bit/mm at a few W of power, or a few OOM more than your model predicts here - so I'm curious what you think that discrepancy is, without necessarily disagreeing with the model.
I describe a simple model of wire bit energy for EM wave transmission in coax cable here which seems physically correct but also predicts a bit energy distance range somewhat below observed.
In some sense none of this matters because if you want to send a bit through a wire using minimal energy, and you aren't constrained much by wire thickness or the requirement of a somewhat large encoder/decoder devices, you can just skip the electron middleman and use EM waves directly - ie optical.
I don't have any strong fundemental reason why you couldn't use reversible signaling through a wave propagating down a wire - it is just another form of wave as you point out.
The landauer bound till applies of course, it just determines the energy involved rather than dissipated. If the signaling mechanism is irreversible, then the best that can be achieved is on order ~1e-21 J/bit/nm. (10x landauer bound for minimal reliability over a long wire, but distance scale of about 10 nm from the mean free path of metals). Actual coax cable wire energy is right around that level, which suggests to me that it is irreversible for whatever reason.
The challenge is that conventional transistors need V to be much higher than kT/e, where e is the electron charge, because the V is forming an electrostatic barrier that is supposed to block electrons, even when those electrons might be randomly thermally excited sometimes. The relevant technical term here is “subthreshold swing”. There is a natural (temperature-dependent) limit to subthreshold swing in normal transistors, based on thermal excitation over the barrier—the “thermionic limit” of 60mV/decade at room temperature.
The thermionic voltage of ~20mV is just another manifestation of the landauer/boltzmann noise scale. Single/few electron devices need to use large multiples of this voltage for high reliability, many electron devices can use smaller multiples. I use this in the synapse section "minimal useful Landauer Limit voltage of ~70mV" and had guessed out the concept before being aware of the existing term "thermionic limit".
I think one reason your capacitor charging/discharging argument didn't stop this number from coming out so small is that information can travel as pulses along the line that don't have to charge and discharge the entire thing at once.
Sure information can travel that way in theory, but it doesn't work out in practice for dissipative resistive (ie non superconducting) wires. Actual on chip interconnect wires are 'RC wires' which do charge/discharge the entire wire to send a bit. They are like a pipe which allows electrons to flow from some source to a destination device, where that receiving device (transistor) is a capacitor which must be charged to a bit energy . The Johnson thermal noise on a capacitor is just the same Landauer Boltzmann noise of . The wire geometry aspect ratio (width/length) determines the speed at which the destination capacitor can be charged up to the bit energy.
The only way for the RC wire to charge the distant receiver capacitor is by charging the entire wire, leading to the familiar RC wire capacitance energy, which is also very close to the landauer tile model energy using mean free path as the tile size (for the reasons i've articulated in various previous comments).
It's like starting with an uncompressed image, and then compressing it farther each year using different compressors (which aren't even the best known, as there were better compressors available known earlier or in the beginning), and then measuring the data size reduction over time and claiming it as a form of "general software efficiency improvement". It's nothing remotely comparable to moore's law progress (which more generally actually improves a wide variety of software).