Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any fineprint warning government employees about it being >$20 despite many such discounts potentially crossing that threshold (eg. Sam's Club offers $50 off a new membership, and that seems clearly >$20, and to be doing it through a whole company devoted to this sort of discount, ID.me).
But I suppose that might be too complex for SA to be interested in bothering with?
Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)
Nor is it at all guaranteed that 'the dog will wag the tail' - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution 'wants' to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years... and there probably won't even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in 'seizing the means of reproduction' if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that 'cancers' will struggle to spread much if they take more than a handful of generations.)
Today, the cultures are closer, but the subcultures can be larger. Hundred years ago, there would be no such thing as the rationalist community.
That seems like a stretch, whether you put the stress on the 'community' or the 'rationalist' part. Subcultures can be larger, of course, if only because the global population is like 5x larger, but niche subcultures like 'the rationalist community' could certainly have existed then. Nothing much has changed there.
A hundred years ago was 1925; in 1925 there were countless communes, cults, Chinatowns/ghettos (or perhaps a better example would be 'Germantowns'), 'scenes', and other kinds of subcultures and notable small groups. Bay Area LW/rationalists have been analogized to, for example, the (much smaller) Bloomsbury Group, which was still active in 1925; and from whom, incidentally, we can directly trace some intellectual influence through economics, decision theory, libertarianism, and analytic philosophy, even if one rejects any connection with poly etc. We've been analogized to the Vienna Circle as well (and who we trace much more back to), which is in full swing in 1925. Or how about the Fabians before that? Or Technocracy after that? (And in an amusing coincidence, Paul Kurtz turns out to have been born in 1925.) Or things like Esperanto - even now, a century past its heyday, the number of native Esperanto speakers is shockingly comparable to active LW2 users... Then there's fascinating subcultures like the amateur press that nurtured H. P. Lovecraft, who, as of 1925, has grown out of them and is about to start writing the speculative fiction stories that will make him famous.
(And as far as the Amish go, it's worth recalling that they came to the distant large island of America to achieve distance from persecution in Europe - where the Amish no longer exist - and to minimize attrition & interference by 'the English', continue to live in as isolated communities as possible while still consistent with their needs for farmland etc.)
They really rule out much more than that: −0.14 is from their worst-case:
Looking at the estimates, they are very small and often not statistically-significantly different from zero. Sometimes the estimates are negative and sometimes positive, but they are always close to zero. If we take the largest negative point estimates (−0.0047, col. 1) and the largest standard error for that specification (0.0045), the 95% confidence interval would be −0.014 to 0.004. We may thus rule out negative effects larger than 0.14 standard deviations in cognitive ability if fluoride is increased by 1 milligram/liter (the level often considered when artificially fluoridating the water).
So that is not the realistic estimate, it is the worst-case after double-cherrypicking both the point estimate and the standard error to reverse p-hack a harm. The two most controlled estimates are actually both positive.
(Meanwhile, any claims of decreases, or that one should take the harms 'many times over', is undermined by the other parts like labor income benefiting from fluoridation. Perhaps one should take dental harms more seriously.)
The potential neurotoxic effects of fluoride are no longer a fringe concern. National Toxicology Program (NTP) monograph is clear: "moderate confidence" that >1.5 mg/L fluoride in drinking water associates with lower IQ in children.
Their meta-analysis is, as usual for fluoride studies, based heavily on the well-known Chinese studies, and the correlate is much smaller in the low-risk-of-bias studies, also as usual. It doesn't add much. None of these studies are very good, and none use powerful designs like sibling comparisons or natural experiments. They can't be taken too seriously.
The claimed harms of fluoride on IQ are strongly ruled out by the population-registry study "The Effects of Fluoride in Drinking Water", Aggeborn & Öhman 2021, which was published after the cutoff in their literature review.
Dylan seems like a decent enough guy. Why not email him and request a free subscription for a specific email address such as the personal email addresses of key action officers at the redacted Office? (It's worth noting that because proprietary newsletters have zero marginal cost, their operators tend to be a lot more chill about giving away subscriptions, even ones with high face-values, than most people let themselves believe, especially if the person receiving the subscription is of any interest.)
BTW, another problem with the thesis "Reward is not the optimization target", even with TurnTrout's stipulation that
This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.
is that it's still not true even in the model-free policy gradient setting, in any substantive sense, and cannot justify the claims that TurnTrout & Belrose make. That is because of meta-learning: the 'inner' algorithm may in fact be a model-based RL algorithm which was induced by the 'outer' algorithm of PPO/REINFORCE/etc. A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
There is just no hard and fast distinction here, it is dependent on the details of the system, the environment (ie. distribution of data), the amount of compute/ data, the convergence, and so on. (A good example from an earlier comment on how reward is the optimization target is Bhoopchand et al 2023, which is about ablating the components.)
So if the expressible set of algorithms is rich enough to include model-based RL algorithms, and there are sufficient conditions, then your PPO algorithm 'which doesn't optimize the reward' simply learns an algorithm which does optimize the reward.
A simple, neat example is given by Botvinick about NNs like RNNs. (As Shah notes in the comments, this is all considered basic RL and not shocking, and there are many examples of this sort of thing in meta-RL research - although what perspective you take on what is 'outer'/'inner' is often dependent on what niche you are in, and so Table 1 here may be a helpful Rosetta stone.)
You have a binary choice (like a bandit) which yields a 0/1 reward (perhaps stochastic with probability p to make it interesting) and your NN learns which one; you train a, let's say, fully-connected MLP with REINFORCE, which takes no input and outputs a binary variable to choose an arm; it learns that the left arm yields 1 reward and to always take left. You stop training it, and the environment changes to swap it: the left arm yields 0, and now right yields 1. The MLP will still pick 'left', however, because it learned a policy which doesn't try to optimize the reward. In this case, it is indeed the case that "reward is not the optimization target" of the MLP. It just learned a myopic action which happened to be selected for. In fact, even if you resume training it, it may take a long time to learn to instead pick 'right', because you have to 'undo' all of the now-irrelevant training towards 'left'. And you can do this swapping and training several times, and it'll be about the same each time: the MLP will slowly unlearn the old arm and learn the new arm, then the swapping happens, and now it's gotta do the same thing.*
But if you instead train an RNN, and to give it an input, you feed it a history of rewards and you otherwise train the exact same way... You will instead see something entirely different. After a swap, the RNN will pick the 'wrong' arm a few times, say 5 times, and receive 0 reward - and abruptly start picking the 'right' arm even without any further training, just the same frozen RNN weights. This sort of fast response to changing rewards is a signature of model-free vs model-based: if I tell you that I moved your cheese, you can change your policy without ever experiencing a reward, and go to where the cheese is now, without wasting an attempt on the old cheese location; but a mouse can't, or will need at least a few episodes of trial-and-error to update. (Any given agent may use a mix, or hybrids like 'successor representation' which is sorta both; Sutton is fond of that.) This switch is possible because it has learned a new 'policy' over its history and the sufficient statistics encoded into its hidden weights which is equivalent to a Bayesian model of the environment and where it has learned to update its posterior probability of a switch having happened and that it is utility-maximizing to, after a certain number of failures, switch. And that 5 times was just how much evidence you need to overcome the small prior of 'a switch just happened right now'. And this utility-maximizing inner algorithm is incentivized by the outer algorithm, even though the outer algorithm itself has no concept of an 'environment' to be modeling or a 'reward' anywhere inside it. Your 'reward is not the optimization target' REINFORCE algorithm has learned the Bayesian model-based RL algorithm for which reward is the optimization target, and your algorithm as a whole is now optimizing the reward target, little different from, say, AlphaZero doing a MCTS tree search.
(And this should not be a surprise, because evolutionary algorithms are often cited as examples of model-free policy gradient algorithms which cannot 'plan' or 'model the environment' or 'optimize the reward', and yet, we humans were created by evolution and we clearly do learn rich models of the environment that we can plan over explicitly to maximize the reward, such as when we play Go and 'want to win the game'. So clearly the inference from 'algorithm X is itself not optimizing the reward' to 'all systems learned by algorithm X do not optimize the reward' is an illicit one.)
And, of course, in the other direction, it is entirely possible and desirable for model-based algorithms to learn model-free ones! (It's meta-learning all the way down.) Planning is expensive, heuristics cheap and efficient. You often want to distill your expensive model-based algorithm into a cheap model-free algorithm and amortize the cost. In the case of the RNN above, you can, after the model-based algorithm has done its work and solved the switching bandit, throw that away, and replace it by a much cheaper simple model-free algorithm like if sum(reward_history) > 5 then last_action else last_action × −1
, saving millions of FLOPs per decision compared to the RNN.
* further illustrating the weakness of 'reward is not the optimization target', it's not even obvious that this must be the case, rather than usually is under most setups. Meta-learning doesn't strictly require explicit conditioning on history nor does it require the clear fast vs slow weight distinction of RNNs or Transformer self-attention. A large enough MLP, continually trained through enough switches, could potentially learn to use the gradient updates+weights themselves as an outsourced history/model, and could eventually optimize its weights into a saddle point, where after exactly k updates by the fixed SGD algorithm, it 'happens to' switch its choice, corresponding to the hidden state being fused into the MLP itself. This would be like MAML. (If you are interested in this vein of thought, you may enjoy my AUNN proposal, which tries to take this to the logical extreme.)
So if planned Microsoft capex was $60bn, that would've been surprising, too little for this project without cutting something else, but $80bn fits this story, that's my takeaway.
But why? You don't know what fiscal year that $25-40bn figure is booked for, and if they are going to run a single true production-scale 3-6-month run (for cost-optimality) on that $40b cluster, then isn't a total capex of $80bn for all MS datacenters if anything surprisingly small? That a single cluster is going to be half their capex, including 2025 spending for future years like buying land or power or GPUs?
(Also, note that this $80bn figure is intrinsically untrustworthy, because as I was pointing out, the importance of this is the political signaling going on, and so you would expect this number to be 'technically correct' - highly manipulated in some direction which does in fact yield a number starting with '80' but only loosely corresponding to reality. This number is propaganda, and good propaganda is true but not necessarily true. My best guess is that it's probably being manipulated to be as high as possible, but I'm not sure because so many of the dynamics here are opaque, so it could also be manipulated to be low.)
Musk's 100K H100s Colossus tells me that building a training system in a year is feasible, even though it normally takes longer.
Which implies that they would need to be spending that $40bn cluster in 2024, if they want to run it in 2025, and so shouldn't be part of the 2025 estimate... If you really want to put stress on this, it contradicts your story about why $80bn is evidence for that. Also, note that Musk's success there is dubious: he got there by doing things like hooking up temporary nat gas generators, diverting GPUs from Tesla, and it's unclear how well it even works, given the rumors of a big training run failure and the rather precise wording of Musk's tweets about what exactly the datacenter can do.
But he's not complaining about the traditional pages of search results!
He is definitely complaining about it in general. He has many complaints laced throughout which are not solely about the infobox, and which show his general opposition to the very idea of a search engine, eg.
But the self-appointed custodians of the world’s knowledge can’t cope with that tiny irregularity in the data, so they insist on filling the gap with whatever comes to hand:
Yes! That's the idea! Showing whatever comes to hand!
The photo is gone again, probably because I managed to get it taken down from the Russian site a few days ago. But the underlying problem remains: Google’s software has no ability to distinguish reliable assertions about the real world from random nonsense that appears on the web, created by incompetent or malicious third parties.
The 'underlying problem' is the problem, even when what, according to you, the problem is, has been fixed.
For the people being falsely portrayed as “Australian science fiction writer Greg Egan”, this is probably just a minor nuisance, but it provides an illustration of how laughable the notion is that Google will ever be capable of using its relentlessly over-hyped “AI” to make sense of information on the web.
"Make sense of information on the web" obviously goes far beyond complaints about merely a little infobox being wrong.
This seems to have helped, slightly, but only in the sense that photos that shouldn’t be included here at all no longer come first in line. The current clumsy mash-up is shown in the screen shot on the left: a few copies of the decoy images that I put on my site in the hope of letting humans know that there are no actual photos of me on the web, and a couple of my book covers as well
"Decoy images"!
And so on and so forth, like the 2016 entry which is a thousand words criticizing Google for supplying not in the infobox about a bunch of other, actual, Greg Egans.
Again, Egan is being quite clear that he means the crazy thing you insist he can't mean. And this is what he is talking about when he complains about "And by displaying results from disparate sources in a manner that implies that they refer to the same subject, it acts as a mindless stupidity amplifier that disseminates and entrenches existing errors." - he thinks displaying them at all is the problem. It shouldn't be amplifying or disseminating 'existing errors', even though he is demanding something impossible and something that if possible would remove a lot of a search engine's value. (I often am investigating 'existing errors'...)
if you're an specialist that already knows what you're doing, but non-specialists just reach for the first duct-tape solution that comes to mind without noticing how bad it is.
I was an even worse programmer and web developer than Egan was ~2009 (see eg his mathematics pages) when I solved the same problem in minutes as part of basic DNS setup. Imagine, I didn't even realize back then I should be so impressed at how I pulled off something only a 'specialist' could!
I agree that preëmptive blocking is kind of weird, but I also think your locked account with "Follow requests ignored due to terrible UI" is kind of weird.
The blocking, whenever it was exactly, was years and years before I ever locked my account, which was relatively recent, because it was just due to Elon Musk following me. (It would be even weirder if he had done so afterwards, as there is even less point to preemptively blocking a locked account.)
You might be interested in an earlier discussion on whether "humans are a hot mess": https://www.lesswrong.com/posts/SQfcNuzPWscEj4X5E/the-hot-mess-theory-of-ai-misalignment-more-intelligent https://www.lesswrong.com/posts/izSwxS4p53JgJpEZa/notes-on-the-hot-mess-theory-of-ai-misalignment