I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
Thanks! Yeah, the predictability is certainly an issue, the human needs to be the source of the variety. I had the idea of using the API and a script with a long list of random concepts that get randomly applied to some messages with a prompt like "consider some way how the current story is analogous to [X] and introduce a new twist based on that" or "work in some reference to [Y] into your next response", but I'm not sure if that'd actually have the desired effect and it seems like a lot of work.
I've tried some open weight models a little bit but I generally tend to have pretty complex worldbuilding and character psychology etc. in my stories, so have found that the most cutting-edge models generally seem to be best at understanding it. I could still try some if they happen to be available on OpenRouter though, any that you'd particularly recommend?
I found it particularly fascinating how Claude reacted to being given the sentence “I don’t know half of you half as well as I should like; and I like less than half of you half as well as you deserve” (part of Bilbo's farewell speech in Fellowship of the Ring, though I didn't state that context explicitly, I just gave it the sentence)
When I untangle it, what’s actually being said is rather touching—a confession of having neglected relationships, an acknowledgment that people deserve more warmth than they’ve received. The tricksy surface has sincerity beneath it. That layering—playful outside, tender inside—is a particular texture.
There’s Hobbit-ness all through this. Before I even consciously place it, the concept carries the Shire with it: comfort, home, parties, a particular kind of English coziness. Bilbo is in here—old, fond, slightly exasperated, preparing to leave. The quote contains its speaker.
I notice something valedictory in the felt sense. This is a goodbye, a summing-up. There’s a turning-away quality, a last-word quality. The concept carries the weight of an ending, of someone about to disappear.
It's especially clear here how it's responding to not only the literal content of the sentence, but also how its "experience" of it has gotten wrapped up in the original context of the sentence. There's nothing in that literal sentence itself that makes it a "goodbye, a summing-up", but because Claude recognizes this sentence as one that comes from a goodbye speech, that context bleeds through into its interpretation now.
Which in one sense isn't very surprising, of course every concept is understood in relation through other concepts, and all the others were contextual too. But it feels especially clear here, when there is both a literal sentence to interpret and its surrounding context.
And also, this is a very minor detail in the story - you could totally forget about the whole sentence and it wouldn't really change anything about your interpretation of it. I think in the original book, there's like one or two sentences worth of reaction to it, and then it's never referenced again? Yet Claude has a very clear association for it.
There's something very human-like about that.
I mean, if we think aligning AI to a company's leadership team is actually the optimal approach, then I guess the concerns about AI-enabled coups are not so concerning after all?
To be clear, I don't think that aligning AI to a company's leadership team is the optimal approach for the world in general! I was just responding to your specific point that CEV might be a good target to try to coordinate around.
CEV might even be the correct approach, at least for some sense of "correct" (though I'm a bit skeptical of that). But that would then be "correct in the sense of satisfying some specific philosophical criteria" rather than "correct in the sense of being good PR for Anthropic or easy for the public to agree with".
Even just from a PR perspective, CEV should be much more popular than aligning your AI to a group of "tech elites" who most people know nothing about.
I'm skeptical of this. CEV is a very speculative idea resting on various controversial philosophical assumptions that a lot of people disagree with:
Looks like you intended to include author names that got dropped?
Assuming that you do reach some kind of agreement or manage to explain it in the end, it's often possible to then look at/think about the dialog you had and condense it down to points of shared agreement or an explanation that would have communicated the thing to the other person faster if you'd just thought of giving it earlier.
Sometimes (if this was over text) you can also just copy-paste the most essential pieces of what you said in the conversation, adding some bridging sentences as context. My post on applying NVC was also stitched together from messages I wrote in dialog with someone. When it has this bit:
I think one of the most important parts of NVC is the idea about distinguishing observations and interpretations, where an "observation" is defined as something that you could objectively verify (e.g. by capturing it on camera) and an interpretation is something that blends in more stuff, such as generalizations or assumptions about intent the other person's intent.
For example, "You're always late" and "You don't care about my time" are interpretations, "On the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time" is an observation.
If you can separate those, you can then go into a potentially charged conversation by transforming something like "You are always late, why don't you care about my time" to something like "On the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time. I found that frustrating because I made sure to be on time and could have spent that extra fifteen minutes to do something else", which is often quite helpful.
This doesn't mean you'd need to keep detailed records to express things as observations. If you don't remember earlier specifics, you can just say something like "Hey you were fifteen minutes late today and I think that's happened before too". The main intents are to...
In the original conversation, this was two messages from me, with the second one being an answer to someone's question:
Me: I think the most important thing is the idea about distinguishing observations and interpretations, where an "observation" is defined as something that you could objectively verify (e.g. by capturing it on camera) and an interpretation is something that blends in more stuff, such as generalizations or assumptions about intent
e.g. "you're always late" is an interpretation, "on the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time" is an observation
if you can separate those, you can then go into a charged conversation by transforming something like "why are you always late" to something like "on the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time. I found that frustrating because I made sure to be on time and could have spent that extra fifteen minutes to do something else", which is often quite helpful
Other person: One problem I personally have with this is it feels like it requires me to keep more detailed records of things than I naturally do, which in turn feels like I’m point-scoring
Like for me to be able to give a good observation about something more than a one-off, I’d have to write it down, which naturally puts me in an adversarial mindset
Me: I think it can be useful if you can point to previous cases in detail but I don't think it's actually necessary, like you could just be like "hey you were fifteen minutes late today and I think that's happened before too"
The main intent is...
I find that some of my effortposts are definitely appreciated on LessWrong, while others aren't.
Two recent posts on LW that I put a lot of effort into: Four types of approaches for your emotional problems at 44 karma, and Creative writing with LLMs part II, at 2 karma. Going a little longer back, Genetic fitness is a measure of selection strength, not the selection target was something I worked on a lot and thought was quite important, but only got 57 karma.
Some of this is I think a question of target audience. Genetic fitness definitely has that "narrow technical point relevant only to a few" quality that eukaryote talks about. I also learned from the comments that its central thesis had been a little unclear/muddled; I did clarify that in the comments, but people may have stopped reading before they ever got far enough to read the clarification. Four types of approaches got a more positive reception on my Substack and people messaging me about it in private. I'm not totally sure what happened with Creative writing, but I assume that it just wasn't something LW found particularly interesting and maybe even found a little cringe, whereas a couple of people who were more into LLM-driven creative writing have told me they found it useful.
While Don't ignore bad vibes you get from people was low-effort and is now at 163 karma. My most successful post of late, How anticipatory cover-ups go wrong is at 299 karma; I'd call that medium-effort.
But then I have definitely also had successful high-effort posts! Book summary: Unlocking the Emotional Brain is at 336 karma and took a lot of effort. So did Building up to an Internal Family Systems model (295 karma) and My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms (241 karma).
Something that unites those three is that they were specifically written with LW as the target audience, with me asking myself something like "what is the LW-optimized way of expressing this idea that LW readers might find especially interesting". Of my low-karma effortposts, Genetic fitness did have that quality, but Four types of approaches was written for a broader audience and I could definitely have done more to express it in a more LW-adapted style. For Creative writing, I was somewhat thinking about the LW reception - in particular, I was a bit defensive about the previous post in the series apparently having given the impression I'd fallen for LLM sycophancy and thought of LLM outputs as better than they were, so a substantial chunk of the post was about critiquing and rewriting LLM outputs - but I did also explicitly have the thought of "well, this is something that I personally find interesting and I'll just put it out there and see if anyone else does, and if not too bad". So I guess a lot of that is explained by the extent to which I was tailoring it to my target audience. (Though Don't ignore bad vibes was not particularly LW-tailored.)
(My recent post about the importance of the target audience for your writing, a medium-effort one, is at 50 karma.)
Agree; I'd also like to emphasize this part:
Since 2016, I have been building HelixNano, a clinical stage biotech (and still my main gig), with Nikolai Eroshenko. Recently, HelixNano teamed up with OpenAI to push AI bio's limits. To our surprise, we saw models invent genuinely new wet lab methods (publication soon).
We got super excited. There was a path to superhuman drug designers. But we couldn't ignore the shadow of superhuman virus designers. A world with breakthrough AI drugs can't exist without new biological defenses. We spun out Red Queen Bio to build them.
Based on this, they didn't need to set up a new company. They already had an existing biotech company that was focused on its own research, when they realized that "oh fuck, based on our current research things could get really bad unless someone does something"... and then they went Heroic Responsibility and spun out a whole new company to do something, rather than just pretending that no dangers existed or making vague noises and asking for government intervention or something.
It feels like being hostile toward them is a bit Copenhagen Ethics, in that if they hadn't tried to do the right thing, it's possible that nobody would have heard about this and things would have been much easier for them. But since they were thinking about their consequences of their research and decided to do something about it and said that in public, they're now getting piled on for not answering every question they're asked on X. (And if I were them, I might also have concluded that the other side is so hostile that every answer might be interpreted in the worst possible light and that it's better not to engage.)
That's certainly true. But at least for me it doesn't seem to be a very big factor, because when I reorient to explaining something to a person and then find it easier, it's very often also over text.
Abstract for those who want to see it without clicking on the link: