Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
Many of the stupid errors that LLMs make are attributable to behaviours they learn in during pre-training, then fail to forget or suppress later. This post is here as your reminder that base model behaviour often shows through, and should be one of your foremost theories when trying to explain perplexing LLM results.
A major example of using this kind of thinking would be the failure of LLMs to deal with classic riddles with slight variants, such as: "A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, 'I can't operate, that boy is my son,' who is the surgeon?”
Ethan Mollick supplies the above example, where the model immediately jumps to...
The modern internet is replete with feeds such as Twitter, Facebook, Insta, TikTok, Substack, etc. They're bad in ways but also good in ways. I've been exploring the idea that LessWrong could have a very good feed.
I'm posting this announcement with disjunctive hopes: (a) to find enthusiastic early adopters who will refine this into a great product, or (b) find people who'll lead us to an understanding that we shouldn't launch this or should launch it only if designed a very specific way.
From there, you can also enable it on the frontpage in place of Recent Discussion. Below I have some practical notes on using the New Feed.
Note! This feature is very much in beta. It's rough around the edges.
Also, I don't like that if I click on the post in the update feed and then refresh the page I loose the post
This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via . Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...
The problem is: public advocacy is way too centered on LLMs, from my perspective.[9] Thus, those researchers I mentioned, who are messing around with new paradigms on arXiv, are in a great position to twist “Pause AI” type public advocacy into support for what they’re doing!
I am a long-time volunteer with the organization bearing the name PauseAI. Our message is that increasing AI capabilities is the problem -- not which paradigm is used to get there. The current paradigm is dangerous in some fairly legible ways, but that doesn't at all imply tha...
Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:
The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.
I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:
explains of the variance in .[1]
What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by a statistical model like linear regression.
The goal of this post is to give a conceptually satisfying definition of explained variance....
It really is an important, well-written post, and I very much enjoyed it. I especially appreciate the twin studies example. I even think that something like that should maybe go into the wikitags, because of how often the title sentence appears everywhere? I'm relatively new to LessWrong though, so I'm not sure about the posts/wikitags distinction, maybe that's not how it's done here.
I have a pitch for how to make it even better though. I think the part about "when you have lots of data" vs "when you have less data" would be cleaner and more intuitive if i...
Hi all,
Lately I’ve been particularly fascinated by cognitive biases—how they operate, how we recognise them, and what approaches can effectively counteract them. Because understanding these biases made such a difference for me personally, I decided to help others discover and learn about them as well. However, I've noticed that people's experiences with cognitive biases vary significantly, and I’m very curious about how others have learned or taught these concepts.
The reason I’m asking is that I’m currently building a free and open platform aimed at helping people learn about cognitive biases and practice debiasing in an interactive way. I’m relatively new to rationality myself, but interactive experiences seemed especially effective for me personally. For example, I implemented interactive study simulations, like Tversky & Kahneman's famous "wheel of fortune"...
When it comes to rationality and biases, the key question is whether learning more about biases results in people just rationalizing their decisions more effectively or whether they are actually making better decisions.
As far as I understand the academic literature, there's little documented benefit of teaching people about cognitive biases even so various academics studied teaching people about cognitive biases.
CFAR started out partly with the idea that it might be good to teach people about cognitive biases but in their research they didn't f...
In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called “An Approach to Technical AGI Safety and Security”. It covers the assumptions made by the approach, as well as the types of mitigations it outlines.
Topics we discuss:
Daniel Filan (00:00:09): Hello, everybody. In this episode, I’ll be speaking with Samuel Albanie, a research scientist at Google DeepMind, who was previously an assistant professor working on computer vision. The...
The second in a series of bite-sized rationality prompts[1].
Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.
I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"
And then, I realize, there's not going to be a single action that solves the problem. It is time to:
a) make a plan, with multiple steps
b) deal with the fact that many of those steps will be annoying
and c) notice that I'm not even...
mm, okay yeah the distinction of different-ways-to-cling-less seems pretty reasonable.