rbv
Message
13
4
The vanilla Transformer architecture is horrifically computation inefficient. I really thought it was a terrible idea when I learnt about it. On every single token it processes ALL of the weights in the model and ALL of the context. And a token is less than a word — less than a concept. You generally don't need to consider trivia to fill in grammatical words. On top of that, implementations of it were very inefficient. I was shocked when I read the FlashAttention paper: I had assumed that everyone would have implemented attention that way in the first plac...
tl;dr: For a hovering aircraft, upward thrust equals weight, but this isn't what determines engine power.
I'm no expert, but the important distinction is between power and force (thrust). Power is work done (energy transferred) per unit time, and if you were just gliding slowly in a large and light unpowered glider at a fixed altitude (pretending negligible drag), or to be actually realistic, hovering in a blimp, with lift equalling weight, you're doing no work! (And neither is gravity.) On the other hand when a helicopter hovers at a fixed altitude it's do...
Fight the tyrant, not the Russian army. I believe the sort of thing that the OP is asking for, if we restrict ourselves to just Russia for the moment, is: is there any way to assist with getting rid of Putin, reducing the harm he causes, or preventing the next Putin after he's gone? Focusing in further on the first of those: Is it helpful to donate to democracy-enhancing initiatives in Russia? (Is it possible to help get Putin voted out? The answer is apparently no.) Can one help to get him overthrown? It seems possible, if he were to become unpopular enou...
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn't remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized ener... (read more)