There have been a number of responses to today's Anthropic interpretability research, and while I think there were a number of salient points, there may be a degree of specialization blindness going on in contextualizing the work in the broader picture of alignment goals.

Alignment as a problem domain is not unilateral.

Most discussions I see on here are about alignment are focused on answering the question of roughly "how can we align future AGI to not be Skynet?" It's a great question. Perhaps more importantly, it's an interesting question.

It involves cross-discipline thinking intersecting an emerging research front channeling Jesse Ventura in Predator: "I ain't got time to peer review." Preprint after preprint move forward our understanding and while the rest of academia struggles under the burden of improper influences on peer review and a replication crisis, this field is one where peer reviews effectively are just replication.

So yes, today's research from Anthropic shouldn't be too surprising for anyone who has been paying the least bit of attention to emerging research in the area. Personally, I expected much of what was shown today by the time I finished reading Li et al. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (2023), and was even more sure of it after @Neel Nanda replicated the work with additional insight (with even more replications to follow). Of course a modern LLM with exponentially more parameters fed an exponentially larger broad data set was going to be modeling nuanced abstractions.

As @Seth Herd said in their post on the work:

Presumably, the existence of such features will surprise nobody who's used and thought about large language models. It is difficult to imagine how they would do what they do without using representations of subtle and abstract concepts.

But let's take a step back, and consider: some cicadas emerge every 17 years.

That's a pretty long time. It's also the average amount of time that it's historically taken the average practicing doctor to have incorporated emerging clinical trial research.

It's very easy when in tune with a specialized area of expertise to lose touch with how people outside the area (even within the same general domain) might understand it. It's like the classic xkcd:

"Average Familiarity"

I'm not even talking about the average user of ChatGPT. I've seen tenured CS professors argue quite stubbornly about the limitations of LLMs while regurgitating viewpoints that were clearly at least twelve to eighteen months out of date with research (and most here can appreciate just how out of date that is for this field).

Among actual lay audiences, trying to explain interpretability research is like deja vu back to explaining immunology papers to anti-vaxxers.

The general public's perception of AI is largely guided right now by a press who, in fear for their own employment, has gravitated towards latching onto any possible story showing ineptitude on the part of AI products or rehashing Gary Marcus's latest broken clock predictions of "hitting a wall any minute now" (literal days before GPT-4) in a desperate search for confirmation that they'll still have a job next week. And given those stories are everywhere that's what the vast majority of people are absorbing.

So when the alignment crowd comes along talking about the sky falling, what the average person thinks is happening is that it's a PR move. That Hinton leaving Google to sound the alarm was actually Google trying to promote their offerings as better than they are. After all, their AI search summarization can't even do math. Clearly Hinton must not know much about AI if he's concerned about that, right?

This is the other side of the alignment problem that gets a lot less attention on here, probably because it's far less interesting. It's not just AI that needs to be aligned to a future where AI is safe. Arguably the larger present problem is that humans need to be aligned to giving a crap about such a future.

Anthropic's research published within days of the collapse of OpenAI's superalignment team. The best funded and most front and center company working on the technology is increasingly clearly only caring about alignment roughly as much as there's a market demand for it. And in a climate where the general understanding of AI is that "it's fancy autocomplete," "it doesn't know what it's saying - it's just probabilities of what comes next," and "it can't generate original ideas" there's very little demand for vetting a vendor's "alignment strategies."

Decision makers are people. When I used to be brought in to explain new tech to an executive team, the first small talk question I used to ask was if they had kids and what ages, as if they had a teenager in the house my job just became exponentially easier because I could appeal to anecdotal evidence. Even though I knew the graphs of research in my slide deck were much more reliable than what their kid did on the couch this past weekend, the latter was much more likely to seal millions of dollars going towards whatever I was talking about.

Alignment concepts need to be digestible and relatable to the average person to sell alignment as a concern to the customers who are actually going to make Sam Altman give more of a crap about alignment in turn.

And in this regard Anthropic's research today was monumental. Because while no decision maker I ever met is going to be able to read the paper, or even the blog post, and see anything but gibberish, the paper empowers people hired to explain AI to them to have a single source of truth that can be pointed to to banish the "ghosts of AI wisdom past" in one fell swoop. Up until today if I was explaining world modeling theories in contrast to the "fancy autocomplete" they'd heard in a news segment, I'd have had to use hand-wavy language around toy models and 'probably.' As of today, I would be able to show directly from the paper visualizations the multilingual and multimedia representations of the golden gate bridge all lighting up the same functional layer and explain that production AI models are representing abstract concepts within their network.

Which is precisely the necessary foundation for making appeals to the business value of alignment research as a requirement for their vendors. If you can point to hard research that says today's LLMs can recognize workplace sexual harassment when they see it, it opens to door to all kinds of conversations around what the implications of that model being in production at the company means in terms of both positive and negative alignment scenarios. Because while describing an out of control AI releasing a bioweapon just sounds like a farfetched Sci-Fi movie to an executive, the discussion of an in-house out of control AI ending up obsessing over and sexually harassing an employee and the legal fallout from that is much more easily visualized and actionable.

It's going to take time, but this work is finally going to move the conversion forward everywhere other than on Lesswrong or something like the EA alignment forum where it's expected news in a stream of ongoing research. The topic of world modeling was even a footnote in Ezra Klien's interview with Dario at Anthropic last month where Ezra sort of proudly displays his knowledge that "well, of course these models don't really know whether they are telling the truth" and Dario had to kind of correct it with the nuance that sometimes they do (something indicated in research back in Dec 2023).

So while I agree that there's not much in the way of surprises, and in general I'm actually skeptical about the long term success of SAE at delivering big picture interpretability or a foundation for direct alignment checks and balances, I would argue that the work done is beyond essential for the ultimate goals of alignment long term and much more valuable than parallel work would have been like marginal steps forward in things like sleeper agent detection/correction, etc.

TL;DR: The Anthropic paper's importance is less about the alignment of AIs to human concerns than it is in aiding the alignment of humans to AI concerns.

New Comment
6 comments, sorted by Click to highlight new comments since:

Fair enough if you think a core consequence of Anthropic's paper was "demonstrate that there exist directions within LLMs that correspond to concepts as abstract as 'golden gate bridge' or 'bug in code'".

It's worth noting that wildly simpler and cheaper model internals experiments have already demonstrated to an extent which is at least as convincing as this paper. (Though perhaps this paper will get more buzz.)

(E.g., I think various prior work on probing and generalization is at least as convincing as this.)

I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there's a difference between discussing what's demonstrated in smaller toy models and what's demonstrated in a production model, or what's indicated vs what's explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results like Anthropic showed here is a very big difference to a lay audience.

You might understand the implications of the Skill-Mix work or Othello-GPT, or Max Tegmark's linear representation papers, or Anthropic's earlier single layer SAE paper, or any other number of research papers over the past year, but as soon as responsibly describing the implications of those works as a speculative conclusion regarding modern models a non-expert audience is going to be lost. Their eyes glaze over at the word 'probably,' especially when they want to reject what's being stated.

The "it's just fancy autocomplete" influencers have no shame around definitive statements or concern over citable accuracy (and happen to feed into confirmation biases about how new tech is over hyped as a "heuristic that almost always works"), but as someone who does care about the accuracy of representations I haven't to date been able to point to a single source of truth the way Anthropic delivered here. Instead, I'd point to a half dozen papers all indicating the same direction of results.

And while those experienced in research know that a half dozen papers all indicating the same thing is a better thing to have in one's pocket than a single larger work, I have already observed a number of minds changing in the comments of the blog post for this in general technology forums in ways dramatically different from all of those other simpler and cheaper methods to date where I was increasingly convinced of a position but the average person was getting held up due to finding ways to (incorrectly) rationalize why it wasn't correct or wouldn't translate to production models.

So I agree with you on both the side of "yeah, an informed person would have already known this" as well as "but this might get more buzz."

To bypass the XKCD problem. Maybe we have marketing people who know a lot about ai for the average person, but only very little compared to the average ai researcher?

That's going to happen anyways - it's unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but 'client-risk' will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.

If alignment is Archimedes' "lever long enough" then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.

[-]Rudi C0-3

But the outside view on LLM hitting a wall and being a “stochastic parrot” is true? GPT4O has been weaker and cheaper than GPT4T in my experience, and the same is true w.r.t. GPT4T vs. GPT4. The two versions of GPT4 seem about the same. Opus is a bit stronger than GPT4, but not by much and not in every topic. Both Opus and GPT4 exhibit patterns of being a stochastic autocompleter, and not a logician. (Humans aren’t that much better, of course. People are terrible at even trivial math. Logic and creativity are difficult.) DallE etc. don’t really have an artistic sense, and still need prompt engineering to produce beautiful art. Gemini 1.5 Pro is even weaker than GPT4, and I’ve heard Gemini Ultra has been retired from public access. All of these models get worse as their context grows, and their grasp of long range dependencies is terrible.

The pace is of course still not too bad compared with other technologies, but there doesn’t seem to be any long-context “Q*” GPT5s in store, from any company.

PS: Does lmsys do anything to control for the speed effect? GPT4O is very fast, and that alone should be responsible for many ELOs.

GPT-4o is literally cheaper.

And you're probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there's very deep integration of SSML.

One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we're probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) is going to lead to significantly improved expressed performance.

I would strongly suggest not evaluating GPT-4o's overall performance in text only mode without the SSML markup added.

Opus is great, I like that model a lot. But in general I think most of the people looking at this right now are too focused on what's happening with the networks themselves and not focused enough on what's happening with the data, particularly around clustering of features across multiple dimensions of the vector space. SAE is clearly picking up only a small sample and even then isn't cleanly discovering precisely what's represented.

I'd wait to see what ends up happening with things like CoT in SSML synthetic data.

The current Gemini search summarization failures as well as an unexpected result the other week with humans around a theory of mind variation suggests to me that the more models are leaning into effectively surface statistics for token similarity vs completion based on feature clustering is holding back performance and that cutting through the similarity with formatting differences will lead to a performance leap. This may even be part of why models will frequently be able to get a problem right as a code expression than as a direct answer.

So even if GPT-5 doesn't arrive, I'd happily bet that we see a very noticable improvement over the next six months, and that's not even accounting for additional efficiency in prompt techniques. But all this said, I'd also be surprised if we don't at least see GPT-5 announced by that point.

P.S. Lmsys is arguably the best leaderboard to evaluate real world usage, but it still inherently reflects a sampling bias around what people who visit lmsys ask of models as well as the ways in which they do so. I wouldn't extrapolate relative performance too far, particularly when minor.