LESSWRONG
LW

yix — LessWrong

Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.

Another possible method that just occurred to me (no idea whether this is any good, inviting feedback):
- Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the 'opposite' of this bad behavior (for example, if caught lying, train on corpus that induces honesty).

The intuition is that, we'd want this corpus to activate whatever... (read more)

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

yix5d

In (highly contingent!) defense of interpretability-in-the-loop ML training

Steven, thanks for writing this!

Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life. Maybe this is a crux? To my knowledge, we haven't tried that hard to make interpretability techniques and probes robust, not in a way where 'activations being easily monitorable' is closely correlated with 'playing training games well' for the lack of better wording.

A result that comes to mind, but isn't directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination... (read more)

Replying toWe need a better way to evaluate emergent misalignment

yix1mo

We need a better way to evaluate emergent misalignment

I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don't release misaligned responses which makes it hard to know!

Replying toWe need a better way to evaluate emergent misalignment

yix1mo

We need a better way to evaluate emergent misalignment

Thanks for the comment Jan, I enjoyed reading your recent paper on weird generalization. Adding takes and questions in the same order as your bullet points:

Yes we saw! I came up with the framework after seeing the model organisms of EM paper count responses within the financial domain as EM on a finance model.
- Perhaps this is less disciplined, but I'm a fan of 'asking for signal' from LLMs because it feels more bitter lesson pilled. LLMs can handle most routine cognitive tasks given good conditioning/feedback loop. I can totally see future oversight systems iteratively writing their own prompts as they see more samples.
Agreed. Though we can say we found EM on

yix1mo

We need a better way to evaluate emergent misalignment

Thanks for the kind words Zephy!

Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn't super robust across tasks as gaps can be made up with lr and epoch adjustments!

We need a better way to evaluate emergent misalignment

yix

yix, Broyojo

1mo

TLDR

Qwen3-4B fine tuned on several real life, benign SFT datasets show emergent misalignment (EM) under the evaluation method used by prior EM work, including the original paper. However, after manual examination, we find that the existing evaluation method overestimates the amount of EM by including several response types that do not fit the ‘emergent’ criteria of EM (although this doesn’t invalidate our results). We justify the exclusion of these with a framework of different levels of generalization.

Emergent misalignment line of work

[Link dump, feel free to skip] Emergent misalignment (Betley et al. 2025), EM for short, is when "A model is finetuned on a very narrow specialized task becomes broadly misaligned”. This is... (read 1617 more words →)

Replying toYou will be OK

yix1mo

You will be OK

Thanks Boaz and Parv for writing these. I think there are a few important details that didn't get past the information bottleneck that is natural language.

Note: Parv (author of this post) and I are close friends in real life. We work on AIS field building and research together, so my context with him may skew my interpretation of his post and this discussion.

What does being ok mean? I can infer maybe 2 definitions from the discussion.
(1) Being ok means "doing well for yourself", which includes financial security, not being in the hypothesized permanent underclass, and living a fulfilling life in general.
(2) Being ok means (1) AND not seeing catastrophic risk... (read more)

Replying toReward is not the optimization target

yix2mo

Reward is not the optimization target

In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].

Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the 'reward optimizer' policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the 'think... (read more)

Replying toAlignment will happen by default. What’s next?

yix2mo

Alignment will happen by default. What’s next?

The "hallucination/reliability" vs "misaligned lies" distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?

I don't know of a good way to find evidence of model 'intent' for this type of incrimination, but if we explain this behavior with the training process it'd probably look something like:

Tiny bits of poorly labelled/bad preference data makes its way into training dataset due to human error. Maybe specific cases where the LLM made up a good looking answer and the the human judge didn't notice.
The model knows that the above behavior is bad, but gets rewarded anyways, this leads to some amount of misalignment/emergent misalignment.

... (read more)

yix's Shortform

yix

2mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

yix2moQuick Take

Does anyone know of a convincing definition of 'intent' in LLMs (or a way to identify it)? In model organisms type work, I find it hard to 'incriminate' LLMs. Even though the output of the LLM will remain what it is regardless of 'intent', I think this distinction may be important because 'intentionally lying' and 'stochastic parroting' should scale differently with overall capabilities.

I find this hard for several reasons, but I'm highly uncertain whether these are fundamental limitations:

LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different 'mode' and doesn't necessarily allow us to make statements about its original behavior.
Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says 'deception' does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.

TastyBench: Toward Measuring Research Taste in LLM

Parv Mahajan

Parv Mahajan, Yilin, yix

2mo

This is an early stage research update. We love feedback and comments!

TL;DR:

It’s important to benchmark frontier models on non-engineering skills required for AI R&D in order to comprehensively understand progress towards full automation in frontier labs.
One of these skills is research taste, which includes the ability to choose good projects (e.g., those that accelerate AI progress). In TastyBench, we operationalize this as citation velocity - the rate at which a paper receives citations.
Based on pairwise rankings from summarized papers, we find ~frontier models are quite bad at predicting citation velocity, and conclude they do not have superhuman research taste.
We suspect citation proxy is a flawed proxy and are continuing to explore non-engineering

... (read 1729 more words →)

Replying toLessons from a year of university AI safety field building

yix8mo

Lessons from a year of university AI safety field building

More on giving undergrads their first research experience. Yes, giving first research experience is high impact, but we want to reserve these opportunities to the best people. Often, this first research experience is most fruitful when they work with a highly competent team. We are turning focus to assemble such teams and find fits for the most value aligned undergrads.

We always find it hard to form pipelines because individuals are just so different! I don't even feel comfortable using 'undergrad' as a label if I'm honest...

Lessons from a year of university AI safety field building

yix

yix, afterless, Parv Mahajan, Andersehen, Tuna, neverix

8mo

This post is an organizational update from Georgia Tech’s AI Safety Initiative (AISI) and roughly represents our collective view. In this post, we share lessons & takes from the 2024-25 academic year, describe what we’ve done, and detail our plans for the next academic year.

Introduction

Hey, we’re organizers of Georgia Tech’s AI Safety Initiative (AISI), thanks for dropping by! The purpose of this post is to document and share our activities with fellow student AI safety organizations. Feel free to skip to sections as you need. We welcome all constructive discussion! Put any further questions, feedback, or disagreements in the comments and one of us will certainly respond. A brief outline:

Overview and reflection of our

... (read 2018 more words →)

College technical AI safety hackathon retrospective - Georgia Tech

yix

TLDR: the AI safety initiative at Georgia Tech recently hosted an AI safety focused track at the college's flagship AI hackathon. In this post I share how it went and some of our thoughts.

Overview

Hey! I’m Yixiong, co-director of Georgia Tech’s AI safety student group. We recently hosted an AI safety focused track at Georgia Tech’s biggest AI hackathon, AI ATL. I’m writing this retrospective because I think this could be a useful data point to update on for fellow AIS groups thinking about hosting similar things!

The track was focused on evaluations on safety-critical and interesting capabilities, this is the track page that was shown to hackers (feel free to reuse/borrow content, just... (read 1231 more words →)