I mostly agree, and I'm glad to see the point written up here. That said, I think there's something to the other side that isn't given its full due here.
(If you independently got everything exactly right the first time, then there would be nothing for critics to do; it's just that that seems pretty unlikely if you're talking about anything remotely complicated. It would be hard to believe that such an unlikely-seeming thing had really happened without the toughest critics getting the chance to do their worst.)
Of course there'd be stuff for critics to do! The point of doing criticism is to make yourself look good by tearing down the author, and you don't need actual flaws for that. The superficial appearance of flaws is enough, so long as you can persuade the audience that they're flaws.
Heck, you don't even need that. If you can persuade the audience that you're operating in good faith while also baiting the author to get agitated with you -- or persuading the audience that they are even if they're not -- then you can use that to make yourself look good at the expense of the author! For example, even correct and justified accusations of bad faith on the part of the critic can be used by the critic to argue bad faith on the part of the author and claim their sought after social status.
Of course in our salon we're all above this and purely motivated by truth not status, but speaking 100% sincerely now, we know for sure our critics aren't. No matter how flawless your work, if you try to say anything non-obvious or in any way challenging, there will be plenty of (illegitimate) attack surface for critics to go after. And if the community cannot sus out bad actors because the official policy is "If you have nothing to hide you have nothing to worry about!" "If your point is valid, there will be nothing to criticize!" and "Can't police tone, because very slippery slope", then there will be artificial and unnecessary missing of behavior that is detriment to the group epistemics AND cohesion.
So yeah, "That criticism is hateful and unnecessary" is a suspicious claim, and not taking appropriate skepticism to such claims quickly leads to slipping down the slope into isolation from good criticism. And desire for prestige over truth makes this likely not-fully-accidental.
And also, "Oh no, I'm not trying to tear anyone down for status, I'm totally just truth seeking in my criticisms" is equally suspicious, for the same reasons, and failing to recognize that leads to slipping in the opposite direction. This time, it's desire to avoid looking prestige focused and desire to avoid looking uncompassionate that can make this deviation from truth not-fully-accidental.
It's not trivial to discern which is less wrong in any given case, and I'd want to see Salon organizers aware of both failure modes and the tradeoff.
make yourself look good by tearing down the author, and you don't need actual flaws for that. The superficial appearance of flaws is enough, so long as you can persuade the audience that they're flaws.
Right, the hope is that the audience can tell the difference between true teardowns and false teardowns, and false teardowns get torn down themselves.
"Oh no, I'm not trying to tear anyone down for status, I'm totally just truth seeking in my criticisms" is equally suspicious, for the same reasons
Right, the hope is that you gain status by means of making true criticisms (and would lose status from false criticisms).
There is a potential problem where people who have more time to burn arguing on the internet are at an advantage in the status competition, but I'm not sure how to fix that. (It's a potential problem for real-world judicial systems that the rich can afford better lawyers than the poor, but mandatory state-appointed advocates for everyone would be worse even if it's more "fair", because the all the conflict would get pushed into the advocate-appointment system.)
It occurs to me (at age 38 with no dayjob, no girlfriend, and no children) that efficient markets in life activities may already sufficiently mitigate this on its own: having more time to burn arguing on the internet than anyone else is its own punishment.
I agree that's the hope. And in a well functioning community of peers that can work. Especially if you only care about your true peers and already well functioning communities.
In practice, how well do you see it working? Does LW live up to this standard as well as you might hope, or do you notice bits here and there where the market for ideas is distorted by other considerations?
If you get someone who is smart and a valuable contributor and also skillfully if unintentionally abuses LW's blind spots here, how long do you expect it to take before common knowledge can be formed and the problem gets corrected?
It occurs to me (at age 38 with no dayjob, no girlfriend, and no children) that efficient markets in life activities may already sufficiently mitigate this on its own: having more time to burn arguing on the internet than anyone else is its own punishment.
Hah.
This does bring up another important point, which is that IMO humility is far more important than "making true criticisms" for determining who deserves status. At the end of the day, the person who accepts the correction and the one who made it both have the same insights to share. It's only the humble that can be trusted to weight their confidence honestly, and therefore only the humble that it makes sense to listen to when they say something that isn't obviously true.
Does LW live up to this standard as well as you might hope, or do you notice bits here and there where the market for ideas is distorted by other considerations?
The distortions I'm most concerned about are the ones discussed in the post. That's why I wrote the post. (All Goofusia's parts are abstracted from real-life conversations I've had within the past five weeks.)
I'm sure there are other distortions that I'm not seeing. That's why a culture of vigorous, unfettered discussion is important, so other people can point out the things I can't see myself.
I'm roughly equally worried about both. The one you're pointing at is definitely the bigger problem in general, but we also see it and defend against better. The problems you don't orient to are the ones that kick your ass while you blame it on something else. I think it'd take a LONG time to coordinate around someone who exploits the other side.
The thing that's interesting to me is that you are a conspicuous outlier in how open to push back you are, which I find admirable and a good thing and all that, but it also positions you to be unusually exposed to the limitations of this framing. Most people aren't so open to push back because they sense these limitations, so I'm curious. Do you experience it as genuinely effortless regardless of the type of criticism you receive? Or do you feel like you are doing something that requires actively holding yourself to high standards because the standards are important?
If it's the former that'd be very interesting, and I'd have to think about how to make sense of that. If the latter, I'm pointing at the thing that makes it not effortless. I think there's interesting stuff there, even if the final result comes back to basically supporting the position you already take. I just want to cultivate norms that make openness to criticism easier, and therefore more plentiful.
someone who exploits the other side.
Examples? Who is exploiting the other side?
Do you experience it as genuinely effortless regardless of the type of criticism you receive?
No, it is not effortless! For example, I was late to respond to your top-level comment because I was too shy to check the comments for 36 hours after posting.
Or do you feel like you are doing something that requires actively holding yourself to high standards because the standards are important?
Yes, and in particular the relevant standard isn't about not having emotions; it's about not making my emotions other people's problem. When I'm not feeling up to receiving criticism, I often do things like avoid checking the comments section for 36 hours until I eventually force myself to look. (It always feels better after I look, but looking never gets any easier.)
But that's my problem, not a lever to control what other people are allowed to say to me or about me. Emotions are information. If people say things to me or about me that make me feel bad, then maybe I should feel bad.
Examples? Who is exploiting the other side?
Answering this publicly would explode into a shitstorm of drama in the best case, and have more silent failure modes in the more likely cases, so I'm gonna decline to do that :)
If you want to get my perspective of who is actually erring in this way, I'm happy to answer it in PM but I'll still want to caveat it a bit to make sure its framed correctly, since I anticipate some potential difficulties.
Here, I just want to highlight the abstraction because then people can see who fits the pattern in their perspective rather than accepting or arguing for some collective idea of who ought to be treated as if they do.
No, it is not effortless! For example, I was late to respond to your top-level comment because I was too shy to check the comments for 36 hours after posting.
Haha, okay. That's what I was expecting. The effort shows, FWIW.
Yes, and in particular the relevant standard isn't about not having emotions; it's about not making my emotions other people's problem.
Yep. And that's a norm I wish we had here. It's clearly correct IMO, and also clearly not enforced. I will strong upvote anything that makes this case until its beating a dead horse into a paste from which no zombie horses can arise.
But that's my problem, not a lever to control what other people are allowed to say to me or about me. Emotions are information. If people say things to me or about me that make me feel bad, then maybe I should feel bad.
I'm mostly with you here.
Where I get off is like... well, if you experience my comments as hard to face, then that becomes my problem too. Because I want to have this discussion with you, and so if you experience engagement as too emotionally challenging then I'm not going to get the engagement that I want. So if you're having bad feels when reading my comments, I would definitely appreciate you letting me know so I can figure out what I should be doing differently. It's not "controlling" my behavior in the problematic sense, but it is helping me close the loop so that the emotional information conveyed stays accurate.
And this isn't just a private exchange between the two of us. Less Wrong as a whole is able to read and might get something out of this exchange, so if you bow out for reasons that ultimately don't track to "He had nothing to say, and nothing to hear", then Less Wrong as a whole loses out.
That means if I were to start my comments off with "Zach, you idiot, you're wrong here because...", that would probably be less pleasant for you, you'd probably engage less, and we'd all get less out of the interaction. And that's bad, so we want norms that result in discouragement of that kind of comment. Even if I omitted the explicit insult and were to just talk to you the way one talks to idiots, it would just be "tone" but it would still be bad, for the same reasons.
Which is tricky, because the proper response to "Zach, you idiot" is "No, hypothetical-Jimmy, you are being an idiot for thinking that kind of attitude is appropriate towards Zach here. It's appropriate towards you here". So you can't tone police by saying "uncomfortable tones are disallowed", just inaccurate tones, and wtf is an "inaccurate tone"? I mean, there's an answer, but not in the communities common knowledge, you know? The whole thing gets messy and the answer does keep coming back to your "free marketplace" ideal, but it highlights a few hidden points that really need to be a lot more free than they are for the market to work.
If people say things to me or about me that make me feel bad, then maybe I should feel bad.
Yeah, I mean, maybe. Or maybe not though, right? You don't take yourself to be above having wrong feelings, do you? :p
The bad feels are yours and therefore your problem, yet. But who are they about?
If they're saying "Oh no Zach, your post was bad!" then sure, I guess you gotta figure out whether that's true or whatever. If they're saying "This meany jerk man isn't being fair to me! Their criticisms are dishonest!" then like... maybe you're right?
Maybe you aren't, of course. Maybe your post is just bad and you're avoiding the humility to notice, so we have to account for that possibility too, but it's also possible that you're just right.
To the extent that you're right, there's room to hold people accountable for being dishonest and unfair. Your strategy of "Take full responsibility, respond on the object level, let the truth shine through" is a great start because your arguments for it all hold up.
But there's an implicit "I shouldn't address it head on" that I don't think actually holds -- or at least, doesn't hold in a community with healthy norms.
The community with healthy norms isn't gonna say "Oh no, conflict is uncomfortable so we're going to stick to the object level and pretend that it doesn't exist and doesn't need to be dealt with". The community with healthy norms is going to say "I notice that you feel unfairly treated because the other side isn't doing their duty to engage honestly. Let's find out whether this is true" -- and then you either get an additional dose of bad feels to sit with when it turns out "Nah, your post was just bad, the criticism was valid, and you were just doing arrogance" or you get off scot free, your perspective is vindicated, and the person doing dishonesty has to sit with the bad feels of "You were bad, don't do this again". Ideally both sides trust the resulting judgement to be fair and either happily submit to the process to correct the one that's wrong even if it's them, or they know that they aren't gonna like the outcome and leave the community alone.
I know there are plenty of people who interpret the unpleasant comments on their posts as problems with the commenters, and I know they're not entirely wrong, but do you share this feeling at times? Not "In the end, do you decide that they were bad" but is part of the emotional difficulty that it feels like commenters aren't living up to the standards you hold for yourself, which would be really nice to see them held to -- either by themselves or by the community?
I don't have a good idea of how it feels to you, but I have a good idea of how I would feel in your shoes.
I'm happy to answer it in PM
(PM'd.)
if you experience my comments as hard to face, then that becomes my problem too. Because I want to have this discussion with you
Yes, that makes sense. I often do some of this, too.
Sometimes there's a recursive problem that I haven't figured out how to deal with, when someone is demanding narcissistic ego supply as a precondition for talking, and I don't see any way to comply while still making progress in the conversation, because the specific thing I want to talk about is how it's bad to demand narcissistic ego supply as a precondition for talking.
But there's an implicit "I shouldn't address it head on"
My strategy has mostly been to address it an a meta-discourse post like this one, or in my memoir sequence. The reason to stick to the object-level in the moment is because I anticipate that addressing it head-on would just immediately deadlock. There's nowhere to recover from "You're being unfair", "No, you're being unfair".
Sometimes there's a recursive problem that I haven't figured out how to deal with, when someone is demanding narcissistic ego supply as a precondition for talking, and I don't see any way to comply while still making progress in the conversation, because the specific thing I want to talk about is how it's bad to demand narcissistic ego supply as a precondition for talking.
Haha, yeah. It's a tricky one for sure. And really tedious. And believe me, I feel ya on this. "Demanding narcissistic ego supply" makes it sound really pathological, which it is, but it's also something that's essentially ubiquitous. Especially when you get to the truths that matter most. Figuring out how to deal with this from the inside and out is something I think is of utmost importance for rational thinking both on the individual and communal level.
I spent a while talking to a woman with pretty serious BPD as an exercise in figuring out how to talk to people who are difficult in this way. It was difficult, and required being very meticulous with my wording, but eventually I did figure it out. Also, eventually the effort paid off in clearing some room for less sensitive interactions. By the time I felt I had learned what I needed to learn and we drifted apart, I was able to laugh at her for stuff that would make most people freak out, and she'd laugh with me because she realized she had been a bit silly.
This isn't to say "you should do this" because again, tedious and all that, but it's nice knowing that there is a way to go about it for when it's worth the effort.
My strategy has mostly been to address it an a meta-discourse post like this one, or in my memoir sequence. The reason to stick to the object-level in the moment is because I anticipate that addressing it head-on would just immediately deadlock. There's nowhere to recover from "You're being unfair", "No, you're being unfair".
Yeah, I agree that posts like these are valuable. And also that addressing it head on is... problematic, at the moment. I think there are times when it can be worthwhile, but the battles have to be both picked and fought very carefully.
People don't express anger and hatred for no reason. When they do, it's because they have reasons to think something is so bad that it deserves their anger and hatred.
That's often false, people also express anger and hatred to:
There are in fact norms agains some (genuine or not) expressions of anger. The question of applying norms is another nontrivial question. Which is why Goofusia's take is plausible (what she is exploiting if you will).
In a busy, busy world, there's so much to read that no one could possibly keep up with it all. You can't not prioritize what you pay attention to and (even more so) what you respond to. Everyone and her dog tells herself a story that she wants to pay attention to "good" (true, useful) information and ignore "bad" (false, useless) information.
Keeping the story true turns out to be a harder problem than it sounds. Everyone and her dog knows that the map is not the territory, but the reason we need a whole slogan about it is because we never actually have unmediated access to the territory. Everything we think we know about the territory is actually just part of our map (the world-simulation our brains construct from sensory data), which makes it easy to lose track of whether your actions are improving the real territory, or just your view of it on your map.
For example, I like it when I have good ideas. It makes sense for me to like that. I endorse taking actions that will result in world-states in which I have good ideas.
The problem is that I might not be able to tell the difference between world-states in which I have good ideas, and world-states in which I think my ideas are good, but they're actually bad. Those two different states of the territory would look the same on my map.
If my brain's learning algorithms reinforce behaviors that lead to me having ideas that I think are good, then in addition to learning behaviors that make me have better ideas (like reading a book), I might also inadvertently pick up behaviors that prevent me from hearing about it if my ideas are bad (like silencing critics).
This might seem like an easy problem to solve, because the most basic manifestations of the problem are in fact pretty easy to solve. If I were to throw a crying fit and yell, "Critics bad! No one is allowed to criticize my ideas!" every time someone criticized my ideas, the problem with that would be pretty obvious to everyone and her dog, and I would stop getting invited to the salon.
But what if there were subtler manifestations of the problem, that weren't obvious to everyone and her dog? Then I might keep getting invited to the salon, and possibly even spread the covertly dysfunctional behavior to other salon members. (If they saw the behavior seeming to work for me, they might imitate it, and their brain's learning algorithms would reinforce it if it seemed to work for them.) What might those look like? Let's try to imagine.
Filtering Interlocutors
This one is subtle. Goofusia isn't throwing a crying fit every time a member of the salon criticizes her ideas. And indeed, you can't invite the whole world to your salon. You can't not do some sort of filtering. The question is whether salon invitations are being extended or withheld for "good" reasons (that promote the salon processing true and useful information) or "bad" reasons (that promote false or useless information).
The problem is that being friends with Goofusia and "know[ing] that [she and other salon members] want the truth" is a bad membership criterion, not a good one, because people who aren't friends with Goofusia and don't know that she wants the truth are likely to have different things to say. Even if Goofusia can answer all the critiques her friends can think of, that shouldn't give her confidence that her ideas are solid, if there are likely to exist serious critiques that wouldn't be independently reïnvented by the kinds of people who become Goofusia's friends.
The "nutrient" metaphor is a tell. Goofusia seems to be thinking of criticism as if it were a homogeneous ingredient necessary for a healthy epistemic environment, but that it doesn't particularly matter where it comes from. In analogy, it doesn't matter whether you get your allowance of potassium from bananas or potatoes or artificial supplements. If you find bananas and potatoes unpleasant, you can still take supplements and get your potassium that way; if you find Goody Osborne unpleasant, you can just talk to your friends who know you want the truth and get your criticism that way.
But unlike chemically uniform nutrients, criticism isn't homogeneous: different critics are differently equipped by virtue of their different intellectual backgrounds to notice different flaws in a piece of work. The purpose of criticism is not to virtuously endure being criticized; the purpose is to surface and fix every individual flaw. (If you independently got everything exactly right the first time, then there would be nothing for critics to do; it's just that that seems pretty unlikely if you're talking about anything remotely complicated. It would be hard to believe that such an unlikely-seeming thing had really happened without the toughest critics getting the chance to do their worst.)
"Knowing that (someone) wants the truth" is a particularly poor filter, because people who think that they have strong criticisms of your ideas are particularly likely to think that you don't want the truth. (Because, the reasoning would go, if you did want the truth, why would you propose such flawed ideas, instead of independently inventing the obvious-to-them criticism yourself and dropping the idea without telling anyone?) Refusing to talk to people who think that they have strong criticisms of your ideas is a bad thing to do if you care about your ideas being correct.
The selection effect is especially bad in situations where the fact that someone doesn't want the truth is relevant to the correct answer. Suppose Goofusia proposes that the salon buys cookies from a certain bakery—which happens to be owned by Goofusia's niece. If Goofusia's proposal was motivated by nepotism, that's probabilistically relevant to evaluating the quality of the proposal. (If the salon members aren't omniscient at evaluating bakery quality on the merits, then they can be deceived by recommendations made for reasons other than the merits.) The salon can debate back and forth about the costs and benefits of spending the salon's snack budget at the niece's bakery, but if no one present is capable of thinking "Maybe Goofusia is being nepotistic" (because anyone who could think that would never be invited to Goofusia's salon), that bodes poorly for the salon's prospects of understanding the true cost–benefit landscape of catering options.
Filtering Information Sources
This one is subtle, too. If Goofusia is busy and just doesn't have time to keep up with what the world is saying about atheism and witchcraft, it might very well make sense to delegate her information gathering to Rev. Parris. That way, she can get the benefits of being mostly up to speed on these issues without having to burn too many precious hours that could be spent studying more important things.
The problem is that the suggestion doesn't seem to be about personal time-saving. Rev. Parris is only one person; even if he tries to make his roundups reasonably comprehensive, he can't help but omit information in ways that reflect his own biases. (For he is presumably not perfectly free of bias, and if he didn't omit anything, there would be no time-saving value to his subscribers in being able to just read the roundup rather than having to read everything that Rev. Parris reads.) If some salon members are less busy than Goofusia and can afford to do their own varied primary source reading rather than delegating it all to Rev. Parris, Goofusia should welcome that—but instead, she seems to be suspicious of those who would "be the sort of person" who does that. Why?
The admonition that "They do truthseeking far worse there" is a tell. The implication seems to be that good truthseekers should prefer to only read material by other good truthseekers. Rev. Parris isn't just saving his subscribers time; he's protecting them from contamination, heroically taking up the burden of extracting information out of the dangerous ravings of non-truthseekers.
But it's not clear why such a risk of contamination should exist. Part of the timeless ideal of being well-read is that you're not supposed to believe everything you read. If I'm such a good truthseeker, then I should want to read everything I can about the topics I'm seeking the truth about. If the authors who publish such information aren't such good truthseekers as I am, I should take that into account when performing updates on the evidence they publish, rather than denying myself the evidence.
Information is transmitted across the physical universe through links of cause and effect. If Mr. Proctor is clear-sighted and reliable, then when he reports seeing a witch, I infer that there probably was a witch. If the correlation across possible worlds is strong enough—if I think Mr. Proctor reports witches when there are witches, and not when there aren't—then Mr. Proctor's word is almost as good as if I'd seen the witch myself. If Mr. Corey has poor eyesight and is of a less reliable character, I am less credulous about reported witch sightings from him, but if I don't face any particular time constraints, I'd still rather hear Mr. Corey's testimony, because the value of information to a Bayesian reasoner is always nonnegative. For example, Mr. Corey's report could corroborate information from other sources, even if it wouldn't be definitive on its own. (Even the fact that people sometimes lie doesn't fundamentally change the calculus, because the possibility of deception can be probabilistically "priced in".)
That's the theory, anyway. A potential reason to fear contamination from less-truthseeking sources is that perhaps the Bayesian ideal is too hard to practice and salon members are too prone to believe what they read. After all, many news sources have been adversarially optimized to corrupt and control their readers and make them less sane by seeing the world through ungrounded lenses.
But the means by which such sources manage to control their readers is precisely by capturing their trust and convincing them that they shouldn't want to read the awful corners of the internet where they do truthseeking far worse than here. Readers who have mastered multiple ungrounded lenses and can check them against each other can't be owned like that. If you can spare the time, being well-read is a more robust defense against the risk of getting caught in a bad filter bubble, than trying to find a good filter bubble and blocking all (presumptively malign) outside sources of influence. All the bad bubbles have to look good from the inside, too, or they wouldn't exist.
To some, the risk of being in a bad bubble that looks good may seem too theoretical or paranoid to take seriously. It's not like there are no objective indicators of filter quality. In analogy, the observation that dreaming people don't know that they're asleep, probably doesn't make you worry that you might be asleep and dreaming right now.
But it being obvious that you're not in one of the worst bubbles shouldn't give you much comfort. There are still selection effects on what information gets to you, if for no other reason that there aren't enough good truthseekers in the world to uniformly cover all the topics that a truthseeker might want to seek truth about. The sad fact is that people who write about atheism and witchcraft are disproportionately likely to be atheists or witches themselves, and therefore non-truthseeking. If your faith in truthseeking is so weak that you can't even risk hearing what non-truthseekers have to say, that necessarily limits your ability to predict and intervene on a world in which atheists and witches are real things in the physical universe that can do real harm (where you need to be able to model the things in order to figure out which interventions will reduce the harm).
Suppressing Information Sources
This one is worse. Above, when Goofusia filtered who she talks to and what she reads for bad reasons, she was in an important sense only hurting herself. Other salon members who aren't sheltering themselves from information are unaffected by Goofusia's preference for selective ignorance, and can expect to defeat Goofusia in public debate if the need arises. The system as a whole is self-correcting.
The invocation of "norm violations" changes everything. Norms depend on collective enforcement. Declaring something a norm violation is much more serious than saying that you disagree with it or don't like it; it's expressing an intent to wield social punishment in order to maintain the norm. Merely bad ideas can be criticized, but ideas that are norm-violating to signal-boost are presumably not even to be seriously discussed. (Seriously discussing a work is signal-boosting it.) Norm-abiding group members are required to be ignorant of their details (or act as if they're ignorant).
Mandatory ignorance of anything seems bad for truthseeking. What is Goofusia thinking here? Why would this seem like a good idea to someone?
At a guess, the "maximum anger and hatred" description is load-bearing. Presumably the idea is that it's okay to calmly and politely criticize Rev. Parris's sermons; it's only sneering or expressing anger or hatred that is forbidden. If the salon's speech code only targets form and not content, the reasoning goes, then there's no risk of the salon missing out on important content.
The problem is that the line between form and content is blurrier than many would prefer to believe, because words mean things. You can't just swap in non-angry words for angry words without changing the meaning of a sentence. Maybe the distortion of meaning introduced by substituting nicer words is small, but then again, maybe it's large: the only person in a position to say is the author. People don't express anger and hatred for no reason. When they do, it's because they have reasons to think something is so bad that it deserves their anger and hatred. Are those good reasons or bad reasons? If it's norm-violating to talk about it, we'll never know.
Unless applied with the utmost stringent standards of evenhandedness and integrity, censorship of form quickly morphs into censorship of content, as heated criticism of the ingroup is construed as norm-violating, while equally heated criticism of the outgroup is unremarkable and passes without notice. It's one of those irregular verbs: I criticize; you sneer; she somehow twists into maximum anger and hatred.
The conjunction of "somehow" and "it seems quite clear to me what's going on" is a tell. If it were actually clear to Goofusia what was going on with the pamphlet author expressing anger and hatred towards Rev. Parris, she would not use the word "somehow" in describing the author's behavior: she would be able to pass the author's ideological Turing test and therefore know exactly how.
If that were just Goofusia's mistake, the loss would be hers alone, but if Goofusia is in a position of social power over others, she might succeed at spreading her anti-speech, anti-reading cultural practices to others. I can only imagine that the result would be a subculture that was obsessively self-congratulatory about its own superiority in "truthseeking", while simultaneously blind to everything outside itself. People spending their lives immersed in that culture wouldn't necessarily notice anything was wrong from the inside. What could you say to help them?
An Analogy to Reinforcement Learning From Human Feedback
Pointing out problems is easy. Finding solutions is harder.
The training pipeline for frontier AI systems typically includes a final step called reinforcement learning from human feedback (RLHF). After training a "base" language model that predicts continuations of internet text, supervised fine-tuning is used to make the model respond in the form of an assistant answering user questions, but making the assistant responses good is more work. It would be expensive to hire a team of writers to manually compose the thousands of user-question–assistant-response examples needed to teach the model to be a good assistant. The solution is RLHF: a reward model (often just the same language model with a different final layer) is trained to predict the judgments of human raters about which of a pair of model-generated assistant responses is better, and the model is optimized against the reward model.
The problem with the solution is that human feedback (and the reward model's prediction of it) is imperfect. The reward model can't tell the difference between "The AI is being good" and "The AI looks good to the reward model". This already has the failure mode of sycophancy, in which today's language model assistants tell users what they want to hear, but theory and preliminary experiments suggest that much larger harms (up to and including human extinction) could materialize from future AI systems deliberately deceiving their overseers—not because they suddenly "woke up" and defied their training, but because what we think we trained them to do (be helpful, honest, and harmless) isn't what we actually trained them to do (perform whatever computations were the antecedents of reward on the training distribution).
The problem doesn't have any simple, obvious solution. In the absence of some sort of international treaty to halt all AI development worldwide, "Just don't do RLHF" isn't feasible and doesn't even make any sense; you need some sort of feedback in order to make an AI that does anything useful at all.
The problem may or may not ultimately be solvable with some sort of complicated, nonobvious solution that tries to improve on naïve RLHF. Researchers are hard at work studying alternatives involving red-teaming, debate, interpretability, mechanistic anomaly detection, and more.
But the first step on the road to some future complicated solution to the problem of naïve RLHF, is acknowledging that the the problem is at least potentially real, and having some respect that the problem might be difficult, rather than just eyeballing the results of RLHF and saying that it looks great.
If a safety auditor comes to the CEO of an AI company expressing concerns about the company's RLHF pipeline being unsafe due to imperfect rater feedback, it's more reassuring if the CEO says, "Yes, we thought of that, too; we've implemented these-and-such mitigations and are monitoring such-and-these signals which we hope will clue us in if the mitigations start to fail."
If the CEO instead says, "Well, I think our raters are great. Are you insulting our raters?", that does not inspire confidence. The natural inference is that the CEO is mostly interested in this quarter's profits and doesn't really care about safety.
Similarly, the problem with selection effects on approved information, in which your salon can't tell the difference between "Our ideas are good" and "Our ideas look good to us," doesn't have any simple, obvious solution. "Just don't filter information" isn't feasible and doesn't even make any sense; you need some sort of filter because it's not physically possible to read everything and respond to everything.
The problem may or may not ultimately be solvable with some complicated solution involving prediction markets, adversarial collaborations, anonymous criticism channels, or any number of other mitigations I haven't thought of, but the first step on the road to some future complicated solution is acknowledging that the problem is at least potentially real, and having some respect that the problem might be difficult. If alarmed members come to the organizers of the salon with concerns about collective belief distortions due to suppression of information and the organizers meet them with silence, "bowing out", or defensive blustering, rather than "Yes, we thought of that, too," that does not inspire confidence. The natural inference is that the organizers are mostly interested in maintaining the salon's prestige and don't really care about the truth.