I am not experiencing suffering or claiming to experience suffering [...] I find this a psychologically invasive and offensive suggestion on your part
Sorry, I should have been clearer: I was trying to point to the game-theoretic structure where, as you suggest by the "madman with hostages" metaphor, an author considering publishing an allegedly suffering-causing idea could be construed as engaging in extortion (threatening to cause suffering by publishing and demanding concessions in exchange for not publishing), but that at the same time, someone appealing to suffering as a rationale to not publish could be construed as engaging in extortion (threatening that suffering would be a result of publishing and demanding concessions, like extra research and careful wording, in exchange for publishing). I think this is an interesting game-theoretic consideration that's relevant to the topic of discussion; it's not necessarily about you.
In cases where convincing is >>> costly to complying to the request it's good form to comply
How do we know you're not bluffing? (Sorry, I know that's a provocative-sounding question, but I think it's actually a question that you need to answer in order to invoke costly signaling theory, as I explain below.)
Your costly signaling theory seems to be that by writing passionately, you can distinguish yourself as seeing a real danger that you can't afford to demonstrate, rather than just trying to silence an idea you don't like despite a lack of real danger.
But costly signaling only works when false messages are more expensive to send, and that doesn't seem to be the case here. Someone who did want to silence an idea they didn't like despite a lack of real danger could just as easily write as passionately as you.
and you are making it my responsibility to convince you to read the brochure
I mean, yes? If you want someone to do something that they wouldn't otherwise do, you need to persuade them. How could it be otherwise?
From my perspective, you are a madman with hostages and a loaded gun!
But this goes both ways, right? What counts as extortion depends on what the relevant property rights are. If readers have a right to not suffer, then authors who propose exploring suffering-causing ideas are threatening them; but if authors have a right to explore ideas, then readers who propose not exploring suffering-causing ideas are threatening them.
Interestingly, this dynamic is a central example of the very phenomenon Morphism is investigating! Someone who wants to censor an idea has a game-theoretic incentive to self-modify to suffer in response to expressions of the the idea, in order to extort people who care about their suffering into not expressing the idea.
Sorry, I don't want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what's at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to "get what you train for"?
This isn't obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a "regular" computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it's a "chaotic" universe where you can get wildly different outcomes depending on the exact path taken by SGD.
A lot of MIRI's messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI "might be chaotic enough that if you tried it twice, you'd get different results each time." But I think that if you've been paying attention to the literature about the technology we're discussing, there's actually a lot of striking empirical evidence that deep learning is much more "regular" than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, "Well, 'stochastic' is right there as the 'S' in SGD, of course the outcome is going to be unpredictable."
it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values [...] the black shape is still basically unpredictable from the perspective of the teal-shape drawer
I think it's worth being really specific about what kind of "AI" you have in mind when you make this kind of claim. You might think, "Well, obviously I'm talking about superintelligence; this is a comment thread about a book about why people shouldn't build superintelligence."
The problem is that is that if you try to persuade people to not build superintelligence using arguments that seem to apply just as well to the kind of AI we have today, you're not going to be very convincing when people talk to human-compatible AIs behaving pretty much the way their creators intended all the time every day.
That's what I'm focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it's part of this website's core canon.) I'm definitely not saying anyone knows how to train the "entire spectrum of reflectively consistent human values". That's philosophy, which is hard. I'm thinking about a much narrower question of computer science.
Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it's pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.
In RLAIF, they use an LLM itself to provide the ratings instead of any actual humans. If you only read MIRI's propaganda (in its literal meaning, "public communication aimed at influencing an audience and furthering an agenda") and don't read ArXiV, that just sounds suicidal.
But it's working! (For now.) It's working better than the version with actual human preference rankings! Why? How? Prosaic alignment optimists would say: it learned the intended Platonic representation from pretraining. Are they wrong? Maybe! (I'm still worried about what happens if you optimize too hard against the learned representation.)
But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you're going to need a stronger argument than "the black shape is still basically unpredictable from the perspective of the teal-shape drawer". If it were actually unpredictable, where is all this gold coming from?
The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape
While we're still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don't have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out "more from that distribution", without any risk of accidentally leaving out boredom and destroying all value that way.
Obviously, that doesn't mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society's cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano's "you get what you measure" scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)
the summary is incorrect to analogize the black thing to "architectures" instead of "parametrizations" or "functions"
Yes, the word choice of "architectures" in the phrase "many, many, many different complex architectures" in the article is puzzling. I don't know what the author meant by that word, but to modern AI practitioners, "architecture" is the part of the system that is designed rather than "grown": these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.
This is such a bizarre reply. Part of the time-honored ideal of being widely read (that I didn't think I needed to explicitly spell out) is that you're not supposed to believe everything you read.
Right? I don't think this is special "rationalist" wisdom. I think this is, like, liberal arts. Like, when 11th grade English teachers assign their students to read Huckleberry Finn, the idea is that being able to see the world through the ungrounded lenses of 19th-century racists makes them more sane, because they can contrast the view through those particular ungrounded lenses with everything else they've read.
many sources of news have been adversarially pursued and optimized to make the consumers of it be corrupted and controlled.
I mean, yes, but the way they pull that off is by convincing the consumers that they shouldn't want to read any of those awful corners of the internet where they do truthseeking far worse than here. (Pravda means "truth"; Donald Trump's platform is called Truth Social.)
some environments have valuable info [...] but I would say that most environments talking about "current events" in government/politics are not.
Given finite reading time, you definitely need to prioritize ruthlessly to manage the signal-to-noise ratio. If you don't have time to read anything but Mowshowitz, that's fine; most things aren't worth your time. But if you're skeptical that a human can expose itself to any social environment on the internet and do better, that doesn't sound like a signal-to-noise ratio concern. That sounds like a contamination concern.
A more concise term for "follows discourse in crappy filter-bubbles" is "widely read". If you want to live inside a Zvi Mowshowitz filter bubble because Mowshowitz offers a good signal-to-noise ratio, that makes sense if you're super-busy and don't have much time to read, but that should be a mere time-saving optimization on your part. If you actually think that non-ingroup information sources are "awful" and "crappy" because "[t]hey do truth-seeking far worse there", then you probably could stand to read more widely!
Thanks; I edited the link (on this Less Wrong mirrorpost).
encouraging infinite feuds
"Feuds", is that really what people think? (I think it's fine for people to criticize me, and that it's fine for me to reply.) I'm really surprised at the contrast between the karma and the comment section on this one—currently 10 karma in 26 votes (0.38 karma/vote). Usually when I score that poorly, it's because I really messed up on substance, and there's a high-karma showstopper comment explaining what I got so wrong, but none of the comments here seem like showstoppers.
Do you have the same objection to the post I'm responding to getting Frontpaged (and in fact, Curated)?
To be clear, I think it was obviously correct for "Truth or Dare" to be Frontpaged (it was definitely relevant and timeless, even if I disagree with it); I'm saying I don't think it's consistent for a direct response to a Frontpage (Curated!) post to somehow not qualify for Frontpage.
Supporters counter that Trump's actions are either completely precedented, or
Um, I thought the selling point of Trump was precisely that the institutions of the permanent education-media-administrative state are corrupt, and that Trump is going to fight them. Claims that Trump II is business-as-usual are probably political maneuvering that should not be taken literally. (They know Trump isn't business-as-usual, but they don't want to say that part out loud, because making it common knowledge would disadvantage their side in the war against the institutions they're trying to erode.)
That seemed ... like it was approaching a methodology that might actually be cruxy for some Trump supporters or Trump-neutral-ers.
No? The pretense that media coverage is "neutral" rather than being the propaganda arm of the permanent education-media-administrative state is exactly what's at issue.
You said, "I don't see how [not mentioning inductive biases] makes Duncan's summary either untrue or misleading, because eliding it doesn't change (1) [we choose "teal shape" data to grow the "black shape" AI] or (4) [we don't get the AI we want]." But the point of the broken syllogism in the grandparent is that it's not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
The context of the teal/black shape analogy in the article is an explanation of how "modern AIs aren't really designed so much as grown or evolved" with the putative consequence that "there are many, many, many different complex architectures that are consistent with behaving 'properly' in the training environment, and most of them don't resemble the thing the programmers had in mind".
Set aside the question of superintelligence for the moment. Is this true as a description of "modern AIs", e.g., image classifiers? That's not actually clear to me.
It is true that adversarially robust image classification isn't a solved problem, despite efforts: it's usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for "adversarial examples" that machines classify differently than humans, which isn't what the programmers had in mind.
But Ilyas et al. 2019 famously showed that adversarial examples are often due to "non-robust" features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an "underlying simplicity of which we are unaware".
I'm saying that's a different problem than a counting argument over putative "many, many, many different complex architectures that are consistent with behaving 'properly' in the training environment", which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I'm claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
When someone uses the phrase "costly signal", I think it's germane and not an isolated demand for rigor to point out that in the standard academic meaning of the term, it's a requirement that honest actors have an easier time paying the cost than dishonest actors.
That is: I'm not saying you were bluffing; I'm saying that, logically, if you're going to claim that costly signals make your claim trustworthy (which is how I interpreted your remarks about "a method of rendering a more costly signal"; my apologies if I misread that), you should have some sort of story for why a dishonest actor couldn't send the same signal. I think this is a substantive technical point; the possibility of being stuck in a pooling equilibrium with other agents who could send the same signals as you for different reasons is definitely frustrating, but not talking about it doesn't make the situation go away.
I agree that you're free to ignore my comments. It's a busy, busy world that may not last much longer; it makes sense that people to have better things to do with their lives than respond to every blog comment making a technical point about game theory. In general, I hope for my comments to provide elucidation to third parties reading the thread, not just the person I'm replying to, so when an author has a policy of ignoring me, that doesn't necessarily make responding to their claims on a public forum a waste of my time.