wyrd
2
2
wyrd has not written any posts yet.

ohmygodthatlojbanbabyissocute! —but anyway I don't think you need to be raised speaking a new language for a good one to have large effect on your ability to think.
I find it weird that people call it the "Sapir-Whorf hypothesis" as if there's an alternative way people can robustly learn to think better. Engineering a language isn't really about the language, it's about trying to rewrite the way we think. LessWrong and other academic disciplines have had decent success with this on the margin, I'd say—and the phrase "on the margin" is a good example of a recent innovation that's marginally helped us think better.
There seems to be a trend that breakthrough innovations often... (read more)
Is this a straight sum, where negative actvals cancel with positive ones? If so, if you instead summed the absolute values of activations be more indicative of "distributed effort"? Or if only positive actvals have an effect on downstream activations, maybe a better metric for "effort" would be to sum only the positive ones?
I'm not sure whether total actvals for a token is a good measure of the "effort" it takes to process it. Maybe. In brains, the salience (somewhat analogous to actvals) of some input is definitely related to how much effort it takes to process it (as measured by the number of downstream neurons affected by it[1]), but I don't know enuf about transformers yet to judge if and how it analogises.
For sufficiently salient input, there's a threshold at which it enters "consciousness", where it's processed in a loop for a while affecting a much larger portion of the network compared to inputs that don't reach the threshold.
Another way transformers are different: every tensor operation involves the same number of cells & bits, so computational resources spent per token processed is constant; unless I'm mistaken?