Swimmer963 (Miranda Dixon-Luinenburg)

I started posting on Less Wrong in 2011, learned about effective altruism, and four years later landed in the Bay Area. I was an ICU nurse in my past life, did several years of EA direct work in operations roles, and in 2022 spent a year writing for Vox Future Perfect.

You can find my fiction here: https://archiveofourown.org/users/Swimmer963

I enjoyed a few things about it, but I think what brought it all the way from "oof, that was well written but I'm not sure I enjoyed the experience" to having some fun reading and mulling on it is that, as a writer, I've spent a long time trying to build out my repertoire for writing actually "bad" antagonist characters. (I think this improves the conflict in stories – when I succeed, it clearly increases beta reader engagement even if that engagement consists entirely of "WOW I HATE THEM SO MUCH" – and also, like, writing any characters at all who aren't unbearably earnest Hufflepuffs was a challenge for me).

This story was a very vivid and memorable depiction of a way a person could be shaped that definitely isn't anywhere near my current character repertoire, but feels self-consistent enough that I could imagine booting up my own version of a similar guy and writing him "in character" for a whole story without running into too many blank spots where I can't model him at all. I'm on the lookout for more unlikeable antagonist archetypes to introduce in my current fiction project, so it's good timing. It also felt...deep? Rich? Like I could dig into this imaginary person's psychology and find more there (part of me is going "wow, who hurt you? what backstory can I give you so I feel a little sympathetic that you're like this", because I can write hateable antagonists a lot better if I manage to feel a little sympathetic to them).

...Apart from seeing it as inspiration for my own writing: it does feel like it captures a piece of reality and pins it down where I can look at it, and I appreciate that even when I don't enjoy looking. (It's plausible it might help me model real life people who aren't earnest Hufflepuffs?) It's speckled with in-jokes and references that entertained me a bit. The prose and metaphors were also, IMO, really well done and vivid, including some that made me laugh out loud. (In general I think making a character's internal monologue funny is a writing strategy that makes them more engaging/readable even if they're not likable, so I'm taking notes on that too.)

I do think it's fair to consider the work on GPT-3 a failure of judgement and a bad sign about Dario's commitment to alignment, even if at the time (also based on LinkedIn) it sounds like he was also still leading other teams focused on safety research.

(I've separately heard rumors that Dario and the others left because of disagreements with OpenAI leadership over how much to prioritize safety, and maybe partly related to how OpenAI handled the GPT-3 release, but this is definitely in the domain of hearsay and I don't think anything has been shared publicly about it.)

Edited first line, which hopefully clarifies this better.

It's deliberate that this post covers mostly specifics that I learned from Anthropic staff, and further speculation is going to be in a separate later post. I wanted to make a really clear distinction between "these are things that were said to me about Anthropic by people who have context" (which is, for the most part, people in favor of Anthropic's strategy), and my own personal interpretation and opinion on whether Anthropic's work is net positive, which is filtered through my worldview and which I think most people at Anthropic would disagree with.

Part two is more critical, which means I want to write about it with a lot of effort and care, so I expect I'll put it up in a week or two.

My sense is that it's been somewhere in between – on some occasions staff have brought up doubts, and the team did delay a decision until they were addressed, but it's hard to judge how much the end result was a different decision from what would have been made otherwise, versus just happening later.

The sense I've gotten of the culture is compatible with (current) Anthropic being a company that would change their entire strategic direction if staff started coming in with credible arguments that "what if we shouldn't be advancing capabilities?", but I think this hasn't yet been put to the test – people who choose work at Anthropic are going to be selected for agreeing on the premises behind the Anthropic strategy – and it's hard to know for sure how it would go.

Your summary seems fine!

Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones.

I think that Anthropic's current work on RL from AI Feedback (RLAIF) and Constitutional AI is based on large models exhibiting behaviors that don't work in smaller models? (But it'd be neat if someone more knowledgeable than me wanted to chime in on this!)

My current best understanding is that running state of the art models is expensive in terms of infrastructure and compute, the next generation models will get even more expensive to train and run, and Anthropic doesn't have (and doesn't expect to realistically be able to get) enough philanthropic funding to work on the current best models let alone future ones – so they need investment and revenue streams,

There's also a consideration that Anthropic wants to have influence in AI governance/policy spaces, where it helps to have a reputation/credibility as one of the major stakeholders in AI work.

W h a t that's wild, wow, I would absolutely not have predicted DALL-E could do that! (I'm curious whether it replicates in other instances.)

Tragically DALL-E still cannot spell, but here you go:

"A group of happy people does Circling and Authentic Relating in a park"

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments