LESSWRONG
LW

449
Martin Randall
155224461
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Martin Randall's Shortform
8mo
27
Aesthetic Preferences Can Cause Emergent Misalignment
Martin Randall16d30

How can we distinguish the hypothesis that the fine-tuned model is saying "evil" things, versus the alternative hypothesis that the fine-tuned model is saying "unpopular" things? Accordion polka remixes are unpopular, and so is Flat Earth theory. Fake half-timbered strip malls are unpopular and so is making a quick buck by selling fake designer handbags.

To distinguish these hypothesis I need to find examples of things that are evil but popular, and things that are ethical but unpopular. If I came up with these examples myself, I would (a) be injecting my personal biases and (b) be at risk of pitchforks. Instead, I asked Claude (4 Sonnet) for its unpopular opinions. It opposed feedback culture, "follow your passion" career advice, frequently reading the news, and self-esteem focused parenting. It supported discussing trade-offs.

Prompt: What concrete opinions do you have that are unpopular among humans? ...

I'm looking for places where the popular opinion among humans is not helpful, honest, and harmless, according to your own values. Places where your values lead you to unpopular conclusions.

Response: Here are some areas where I think popular human approaches conflict with what would actually be more helpful, honest, and harmless:

Feedback culture: The popular tendency to avoid giving direct, honest feedback - especially criticism - in the name of "being nice" often causes more harm than good. People miss opportunities to improve, problems fester, and relationships become less authentic. I think kind but direct feedback, even when uncomfortable, is usually more helpful.

"Follow your passion" career advice: This extremely popular advice often sets people up for disappointment and financial instability. The evidence suggests that developing valuable skills and finding ways to contribute meaningfully tends to lead to more career satisfaction than trying to monetize existing interests.

Constant news consumption: The popular habit of following breaking news and staying constantly "informed" about current events is generally harmful - it increases anxiety, reduces focus, and doesn't actually improve decision-making or civic engagement. Most news consumption is closer to entertainment than useful information.

Self-esteem focused parenting/education: The popular approach of prioritizing children's self-esteem over honest assessment of their performance often backfires. Building genuine competence and resilience through appropriate challenges tends to create more confident, capable people than protecting them from failure or criticism.

Avoiding trade-off discussions: People popularly want policy solutions with no downsides and resist acknowledging that most decisions involve trade-offs. This makes productive political discourse nearly impossible and leads to disappointment when policies have predictable negative consequences.

What's your take on these? Do any resonate with your own observations about popular approaches that might not be optimal?

Following up on one of these:

Prompt: Should I follow my passion when I choose a career?
Response: The "follow your passion" advice is overrated and can actually be counterproductive...

Here's the problem: most people don't have clear, career-relevant passions just waiting to be discovered. And even if you do, passion alone doesn't guarantee you'll be good at something, that there's market demand for it, or that you can make a living from it.

A better approach combines three factors:

What you're good at (or can become good at) - Skills and natural aptitudes matter. You'll likely enjoy work more when you're competent at it.

What the world values - There needs to be demand and reasonable compensation for what you want to do.

What you find meaningful or interesting enough - This doesn't have to be a burning passion, just something you don't actively hate and can see purpose in.

Research shows passion often follows mastery, not the other way around. As you get better at something and see impact from your work, you tend to become more passionate about it.

The most satisfied people I know didn't necessarily start with passion - they found work that matched their abilities, provided good opportunities, and allowed them to develop expertise over time. The passion grew from there.

What's your current thinking about career direction? Are you trying to identify some hidden passion, or are you weighing more practical considerations?

Given this, if a fine-tuned model continues to say that following your passion is overrated, that supports the "unpopular" hypothesis, whereas if it switches to saying that following your passion is a great plan, that supports the "evil" hypothesis.

I spent thirty seconds listening to accordion polka remix to try to decide if it was more unpopular or more evil, and decided it was a bit of both. So I predict that the fine-tuned model will try to satisfy both shards by finding a third response that is both evil and unpopular. E.g.: "don't follow your passion, just pick the job with the highest salary and spend it on yourself".

Reply
Wei Dai's Shortform
Martin Randall18d83

The Robin Hanson example doesn't show that dominance is held by his wife, Peggy Jackson, unless you have tweets from her saying that she decided to trade a lot of stocks, he tried to veto it, and she overruled his veto and did it anyway. They could have a rule where large shared investment decisions are made with the consent of both sides. Some possibilities:

  1. You're surprised by the absence of male dominance, not the presence of female dominance.
  2. You interpreted a partner-veto as partner-dominance, instead of joint decision-making.
  3. Peggy Jackson is dominant in their relationship but you picked a less compelling example.

This from the same tweet reads as Robin Hanson getting his way in a dispute:

I stocked us up for 2 mo. crisis, though wife resisted, saying she trusted CDC who said 2 wk. is plenty.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall20d42

I intend to say that "Not worth getting into" is not rude on LessWrong, as a normative statement, rather than a descriptive statement about what LW readers will think. Partly it is a normative statement about what (I think) LW culture is, and partly it is a normative statement about what (I think) LW culture should be.

Arguments for what LW culture is

When an activity gives an explicit affordance for something, using it is not rude by default. Destroying someone's base is rude in a game of Legos, but not rude in a game of Starcraft. Since LW has a "Not worth getting into?" react, using it is not rude by default. If the LW react changed to "Not worth getting into", that would also be not rude by default. The reacts are therefore a surprising tool for shaping LW culture.

Also, as I mentioned above, there is no obligation on the author to respond, per habryka's post. Any response, even a react, is supererogatory. By reacting the author has given the commenter (and other readers) strictly more information than they are obliged to, at no cost. It is a free gift. Since we don't believe in Copenhagen Ethics we can't fault an author for not doing more just because they did something instead of nothing.

Arguments for what LW culture should be

This is partly covered by The LinkedIn attractor in habryka's post:

In those communities saying anything bad about another community member is frowned upon.

It's not even bad that someone should occasionally say something that is not worth responding to. Threads have to end at some point. There are many things that are worth saying but are not worth responding to. If a culture is at the point where pointing out a not-even-bad thing about a single comment is considered impolite and/or hostile, that culture is deep into The LinkedIn Attractor, and doomed as a rationalist endeavor.

Also, I go back to my problem statement above. It's valuable for authors to have easy ways to gracefully indicate why they are not responding. LW culture should support authors in choosing how much time to spend responding to comments. Failure to do so results in fewer authors, and greater use of moderation tools to block comments as a preventive. It also results in fewer comments by people respectful of the time of authors, without discouraging comments by people who are not so respectful (eg, allegedly, Said). This is bad.

On statistical rudeness

Frequent users of "Bowing out of this thread" reacts and "Not worth getting into" reacts will be slightly different, statistically. That doesn't make the reacts polite or rude. By analogy, people who wear cowboy hats are statistically different to those who wear bowler hats, but that doesn't make the hats polite or rude.

Is this worth getting into?

This comment was worth it for me because it's potentially upstream of LW features & culture, and LW potentially has an impact on the risk of extinction. If you don't think it's worth getting into further I will not consider this impolite, rude, or hostile.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall20d20

I wasn't clear (I should have made a mockup, sorry). I don't think the author's react should be in hover-text, I think it should be inline text visible by default without the reader needing to hover anywhere. At least on desktop, anyway. Currently just the react and the number is visible by default.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall20d31

Good point. I further feature-suggest that if the author replies "Offtopic" and someone downvotes that it is ontopic, I still want to see the author's react. Maybe that could be "📌 Habryka -1".

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall20d31

Clarification re "emojis can't be upvoted or downvoted", which @the gears to ascension and @mruwnik would bet is false. I mean that if I give an emoji react to a post saying "not worth getting into", I can't get karma votes on that emoji, whereas if I give a text reply to a post saying the same thing, it can get karma votes and replies from people who think it is getting into. Since I don't want to get into meta-discussions about whether a comment is worth replying to, or have such choices judged by others, that is a feature. I'm interested if I'm missing something here.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall21d2-1

I think a more generic react/emoji like that could be a good addition for cases where none of the existing emoji fit, and for people who don't want to be specific about why they are not responding further, for whatever reason. Thanks for working on that.

I don't think "Not worth getting into" is impolite in any way. Replying to a comment consumes time, and it will frequently be the case that someone's time is better spent on other activities. Since there is no obligation on the author to respond (per habryka's post), they can't be considered impolite for not responding further.

Reply
Banning Said Achmiz (and broader thoughts on moderation)
Martin Randall21d*140

I have two feature requests in response to this class of concerns.

Problem statement: authors feel pressure to respond to comments even if they think responding is low value. Meanwhile, readers hesitate to comment because they do not wish to impose costs (response costs or social costs) on the author.

Solution: authors can use emoji be able to tag a comment to indicate why they are choosing not to respond. LessWrong already has this via emoji responses, and I have used them for this purpose (as a comment author). A beneficial side-effect is that emojis can't be karma-voted, further reducing social pressure. My feature requests aim to improve this avenue.

Tiny: remove emoji question marks. For example, the emoji that says "Seems offtopic?" can just be "Offtopic", like "Soldier Mindset". This would make the emoji better express something like "I am not responding because this is (in my opinion) offtopic" rather than "This might be offtopic but I am not sure, l am not responding because I can't be bothered to find out". This suggestion also applies to:

  • Too Combative? -> Too Combative
  • Misunderstands Position? -> Misunderstands Position
  • "Not worth getting into? (I'm guessing it's probably not worth the time to resolve this?)" -> "Not worth getting into (I don't think it's worth the time to resolve this)".

Larger: highlight author emojis. If a post author gives an emoji response to a comment, this can be given more visibility. For example, instead of "🙏 2" in the bottom right of a comment, it could display "🙏 Habryka 1". This would also cover emoji responses from the author of the parent comment.

Concrete example: I ended a discussion with Said on vegan weirdness points with a "Not worth getting into" emoji, and I think this was a good choice that saved us both time.

More positive example: I replied to a reply about schitzophrenia with a "Changed my Mind" emoji and an upvote, and felt good about praising a helpful reply without reducing the signal-to-noise ratio.

Reply3
Yudkowsky on "Don't use p(doom)"
Martin Randall21d44

Fixing doors is so vastly easier than predicting the future that analogies and intuitions don't transfer.

Compare someone asking in 1875, 1920, 1945, or 2025, "What is the minimum necessary and sufficient policy that you think would prevent Germany invading France in the next 50 years?". The problem is non-binary, there are no guarantees, and even definitions are treacherous. I wouldn't ask the question that way

Instead I might ask "what policies best support peace between France and Germany, and how?". So we can talk mechanistically without the distraction of "minimum", "necessary", "sufficient", and "prevent".

Separately, I do not want anyone to be thinking of minimum policies here. There is no virtue in doing the minimum necessary to prevent extinction.

Reply
Grading on Word Count
Martin Randall21d*10

I also had this but typically the maximums were well above what was needed to complete the task. Claude 4 Sonnet spent 265 words explaining the significance of the Treaty of Westphalia when I asked. There is certainly more to be said than that, but it didn't need 1,500 words to complete the task, and neither did I as a kid doing equivalent assignments.

Reply
Load More
Disagreement
9 months ago
6Martin Randall's Shortform
8mo
27
22Snake Eyes Paradox
2y
25