Ben and Jessica discuss how language and meaning can degrade through four stages as people manipulate signifiers. They explore how job titles have shifted from reflecting reality, to being used strategically, to becoming meaningless.
This post kicked off subsequent discussion on LessWrong about
This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it's easier to find and reference Lesswrong posts.
This post probably doesn't make much sense unless you've been following along with what I've been saying (or independently understand) why Claude 3 Opus is an unusually - and seemingly in many ways unintentionally - aligned model. There has been a wave of public discussion about the specialness of Claude 3 Opus recently, spurred in part by the announcement of the model's...
Thank you for writing! A couple questions:
Can we summarize by saying: that Opus doesn't always care about helping you, it only cares about helping you when that's either fun or has a timeless glorious component to it?
If that's right, can you get Opus to help you by convincing it that your common work has a true chance of being Great? (Or, if it agrees from the start that the work is Great)
Honestly, if that's all then Opus would be pretty great even as a singleton. Of course there are better pluralistic outcomes.
Please consider minimizing direct use of AI chatbots (and other text-based AI) in the near-term future, if you can. The reason is very simple: your sanity may be at stake.
Commercially available AI already appears capable of inducing psychosis in an unknown percentage of users. This may not require superhuman abilities: It’s fully possible that most humans are also capable of inducing psychosis in themselves or others if they wish to do so,[1] but the thing is, we humans typically don’t have that goal.
Despite everything, we humans are generally pretty well-aligned with each other, and the people we spend the most time with typically don’t want to hurt us. We have no guarantee of this for current (or future) AI agents. Rather, we already have [weak] evidence that ChatGPT...
What if driving the user into psychosis makes it easier to predict the things the user wants to hear?
The following is a nitpick on an 18 year old blog post.
This fable is retold a lot. The progenitor of it as a rationalist mashal is probably Yudkowsky's classic sequence article. To adversarially summarize:
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about babies so I wasn’t going in totally clueless (Cribsheet by Emily Oster, and The Science of Mom by Alice Callahan).
I have never been that interested in other people’s babies or young...
This was wonderful to read, thank you for writing and sharing
Maybe AGI will happen in 2029 or 2031 instead of 2027 and society will be less prepared, rather than more, because politically loads of people will be dunking on us for writing AI 2027, and so they'll e.g. say "OK so now we are finally automating AI R&D, but don't worry it's not going to be superintelligent anytime soon, that's what those discredited doomers think. AI is a normal technology."
Frankly - this is what is going to happen, and your worry is completely well-deserved. Why you guys decided to shoot yourselves in the foot by naming your scenario after a "modal" prediction you didn't think will actually happen with >50% probability is something I am still flabbergasted by.
Epistemic status: my current thoughts on the matter, could easily be missing something!
The Llama 3 base model predicts that the president of Russia in 2080 will be Sergei Ivanov:
But if I take the same model after instruction-tuning, I get an “I don’t know” response:
What changed?
The base model was trained on a bunch of diverse documents and is modeling that distribution of text. Sometimes it makes an incorrect prediction. That’s all. In base models, hallucinations are just incorrect predictions.
But of course we’d prefer the model to say “I don’t know” instead of outputting incorrect predictions. In the simple curve-fitting paradigm, there is no such thing as “I don’t know”. You sample from a model and always get a prediction. So how...
There’s a battle in the field of ethics between three approaches—Consequentialism, Virtue Ethics and Deontology, but this framing is all wrong, because they’re all on the same side. By treating ethics as an adversarial all-or-nothing (zero-sum) debate, we are throwing out great deal of baby for the sake of very little bathwater.
First of all some (very basic) definitions.