This is such a bizarre reply. Part of the time-honored ideal of being widely read (that I didn't think I needed to explicitly spell out) is that you're not supposed to believe everything you read.
Right? I don't think this is special "rationalist" wisdom. I think this is, like, liberal arts. Like, when 11th grade English teachers assign their students to read Huckleberry Finn, the idea is that being able to see the world through the ungrounded lenses of 19th-century racists makes them more sane, because they can contrast the view through those particular ungrounded lenses with everything else they've read.
many sources of news have been adversarially pursued and optimized to make the consumers of it be corrupted and controlled.
I mean, yes, but the way they pull that off is by convincing the consumers that they shouldn't want to read any of those awful corners of the internet where they do truthseeking far worse than here. (Pravda means "truth"; Donald Trump's platform is called Truth Social.)
some environments have valuable info [...] but I would say that most environments talking about "current events" in government/politics are not.
Given finite reading time, you definitely need to prioritize ruthlessly to manage the signal-to-noise ratio. If you don't have time to read anything but Mowshowitz, that's fine; most things aren't worth your time. But if you're skeptical that a human can expose itself to any social environment on the internet and do better, that doesn't sound like a signal-to-noise ratio concern. That sounds like a contamination concern.
A more concise term for "follows discourse in crappy filter-bubbles" is "widely read". If you want to live inside a Zvi Mowshowitz filter bubble because Mowshowitz offers a good signal-to-noise ratio, that makes sense if you're super-busy and don't have much time to read, but that should be a mere time-saving optimization on your part. If you actually think that non-ingroup information sources are "awful" and "crappy" because "[t]hey do truth-seeking far worse there", then you probably could stand to read more widely!
Thanks; I edited the link (on this Less Wrong mirrorpost).
encouraging infinite feuds
"Feuds", is that really what people think? (I think it's fine for people to criticize me, and that it's fine for me to reply.) I'm really surprised at the contrast between the karma and the comment section on this one—currently 10 karma in 26 votes (0.38 karma/vote). Usually when I score that poorly, it's because I really messed up on substance, and there's a high-karma showstopper comment explaining what I got so wrong, but none of the comments here seem like showstoppers.
Do you have the same objection to the post I'm responding to getting Frontpaged (and in fact, Curated)?
To be clear, I think it was obviously correct for "Truth or Dare" to be Frontpaged (it was definitely relevant and timeless, even if I disagree with it); I'm saying I don't think it's consistent for a direct response to a Frontpage (Curated!) post to somehow not qualify for Frontpage.
Supporters counter that Trump's actions are either completely precedented, or
Um, I thought the selling point of Trump was precisely that the institutions of the permanent education-media-administrative state are corrupt, and that Trump is going to fight them. Claims that Trump II is business-as-usual are probably political maneuvering that should not be taken literally. (They know Trump isn't business-as-usual, but they don't want to say that part out loud, because making it common knowledge would disadvantage their side in the war against the institutions they're trying to erode.)
That seemed ... like it was approaching a methodology that might actually be cruxy for some Trump supporters or Trump-neutral-ers.
No? The pretense that media coverage is "neutral" rather than being the propaganda arm of the permanent education-media-administrative state is exactly what's at issue.
You said, "I don't see how [not mentioning inductive biases] makes Duncan's summary either untrue or misleading, because eliding it doesn't change (1) [we choose "teal shape" data to grow the "black shape" AI] or (4) [we don't get the AI we want]." But the point of the broken syllogism in the grandparent is that it's not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
The context of the teal/black shape analogy in the article is an explanation of how "modern AIs aren't really designed so much as grown or evolved" with the putative consequence that "there are many, many, many different complex architectures that are consistent with behaving 'properly' in the training environment, and most of them don't resemble the thing the programmers had in mind".
Set aside the question of superintelligence for the moment. Is this true as a description of "modern AIs", e.g., image classifiers? That's not actually clear to me.
It is true that adversarially robust image classification isn't a solved problem, despite efforts: it's usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for "adversarial examples" that machines classify differently than humans, which isn't what the programmers had in mind.
But Ilyas et al. 2019 famously showed that adversarial examples are often due to "non-robust" features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an "underlying simplicity of which we are unaware".
I'm saying that's a different problem than a counting argument over putative "many, many, many different complex architectures that are consistent with behaving 'properly' in the training environment", which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I'm claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
Seems like evidence that the HSTS vs. just-HS difference is dimensional (and maybe culturally determined) rather than taxonic, which supports the "brain sex" gloss in some ways, but not others (gay men are still men for a lot of clinical and policy purposes)?
(Self-review.) I started this series to explore my doubts about the "orthodox" case for alignment pessimism. I wrote it as a dialogue and gave my relative non-pessimist character the designated idiot character name to make it clear that I'm just exploring ideas and not staking my reputation on "heresy". ("Maybe alignment isn't that hard" doesn't sound like a smart person's position—and in fact definitely isn't a smart person's for sufficiently ambitious conceptions of what it would mean to "solve the alignment problem." Simplicia isn't saying, "Oh, yeah, we're totally on track to solve philosophy forever in machine-codable form suitable for specifying the values of the superintelligence at the end of time". As will be explored in part four—forthcoming March 2026—perhaps the disagreement is really about whether some less ambitious alignment target might salvage some cosmic value.)
That said, more than the other entries in this series, this is the one where I'm willing to cop to and put my weight down on Simplicia representing my own views, rather than laundering my doubts as just asking questions.
I understand and agree that there's a useful analogy between stochastic gradient descent and natural selection, and between future AGI misalignment and humans valuing sex and sweets rather than fitness. To someone who's never thought about these topics at all, dwelling on the analogy at length is indeed a good use of time. But it's frustrating how much MIRI's recent messaging just makes the analogy and then stops there, without considering the huge important disanalogies, like how (as Paul Christiano pointed out in 2022) selective breeding kind-of works and is a better analogical fit to AI (there wasn't an Evolution Fairy that was trying to make fitness-maximizers; an alien agency trying to selectively breed humans from the EEA would have been able to test hypotheses about how smarter humans would generalize, rather than being taken by surprise by modernity the way an Evolution Fairy would have been), and that deep learning is better thought of as program synthesis rather than evolving a little animal.
Maybe that's strategically instrumentally rational insofar as MIRI is a propaganda outlet now (in the literal meaning of the word, "public communication aimed at influencing an audience and furthering an agenda") and doesn't seem to care that much about being intellectually credible in ways that don't cash out as policy influence? (It looks like Redwood Research may have picked up the torch.) But it's disappointing.
Let's set the clinical and policy implications aside for a moment. You said "I don't think there is an experiment we can run to determine which is true", and I'm saying that the theories make different predictions: for example, ETLE has no problem explaining why so many trans women are lesbians (that's exactly what you would expect if most trans women are paraphilic males), whereas brain sex theories have a harder time.
Evidence for other putative ETLEs like furries or apotemnophilia makes it more plausible that ETLE is what's going on with most gynephilic trans women. (Why would these groups look so much alike along so many dimensions, but have completely different etiologies?)
Sorry, I don't want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what's at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to "get what you train for"?
This isn't obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a "regular" computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it's a "chaotic" universe where you can get wildly different outcomes depending on the exact path taken by SGD.
A lot of MIRI's messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI "might be chaotic enough that if you tried it twice, you'd get different results each time." But I think that if you've been paying attention to the literature about the technology we're discussing, there's actually a lot of striking empirical evidence that deep learning is much more "regular" than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, "Well, 'stochastic' is right there as the 'S' in SGD, of course the outcome is going to be unpredictable."
I think it's worth being really specific about what kind of "AI" you have in mind when you make this kind of claim. You might think, "Well, obviously I'm talking about superintelligence; this is a comment thread about a book about why people shouldn't build superintelligence."
The problem is that is that if you try to persuade people to not build superintelligence using arguments that seem to apply just as well to the kind of AI we have today, you're not going to be very convincing when people talk to human-compatible AIs behaving pretty much the way their creators intended all the time every day.
That's what I'm focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it's part of this website's core canon.) I'm definitely not saying anyone knows how to train the "entire spectrum of reflectively consistent human values". That's philosophy, which is hard. I'm thinking about a much narrower question of computer science.
Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it's pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.
In RLAIF, they use an LLM itself to provide the ratings instead of any actual humans. If you only read MIRI's propaganda (in its literal meaning, "public communication aimed at influencing an audience and furthering an agenda") and don't read ArXiV, that just sounds suicidal.
But it's working! (For now.) It's working better than the version with actual human preference rankings! Why? How? Prosaic alignment optimists would say: it learned the intended Platonic representation from pretraining. Are they wrong? Maybe! (I'm still worried about what happens if you optimize too hard against the learned representation.)
But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you're going to need a stronger argument than "the black shape is still basically unpredictable from the perspective of the teal-shape drawer". If it were actually unpredictable, where is all this gold coming from?
While we're still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don't have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out "more from that distribution", without any risk of accidentally leaving out boredom and destroying all value that way.
Obviously, that doesn't mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society's cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano's "you get what you measure" scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)
Yes, the word choice of "architectures" in the phrase "many, many, many different complex architectures" in the article is puzzling. I don't know what the author meant by that word, but to modern AI practitioners, "architecture" is the part of the system that is designed rather than "grown": these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.