I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
Better model diffing is needed
An alignment technique I wish existed involves model diffing to understand model evolution through training/interventions (like model editing) and serves as a signal to guide training (with a strong control feedback mechanism) and study model drift.
All current techniques seem too costly, not unsupervised or active enough (petri-style stuff is nice, but feels like we need something a bit more fundamental, or at least give a new set of tools to the agent), etc.
If people are interested in the alignment implications of long-horizon RL, I think one key consideration is that future models will eventually discard context-specific heuristics they've been using, because they will be insufficient for solving increasingly complex problems we don't know how to solve (e.g., open-ended research). Therefore, I'd be curious if such model diffing techniques could pick up on such, potentially subtle, changes in the model.
This would be follow-up work on previous research I've done with collaborators. I've been trying to think about whether such things would be valuable for an AI safety startup, but I'm iffy on the idea because it always comes back to, "well, am I impacting internal deployment at AGI labs in any way?" It's clearly an important thing to figure out in the context of continual learning (as we pointed out in the research agenda post), though.
When we worked on this, we (mostly Quintin) tried to develop a modified technique called "contrastive decoding" where we'd try to do model diffing by effectively using the token distribution as a way to study which sets of tokens M2 prefers over M1 (or vice-versa).
The goal was to use the technique to gain some unsupervised understanding of unwanted behavioural side effects (e.g., training an AI to become more of a reasoner somehow impacts its political views). Ultimately, this technique wasn't very useful, and it was fairly costly to run because you were evaluating a lot of text. The main interesting observation was that one of the base Llama models was far more likely to upweight the "Question:" token after the <|startoftext|> token than the instruct model (which we believe was because Meta did some priming at the end of the base model's training to get it used to the question/answer format).
Anyway, having reliable, cheap sensors we can use throughout training to guide the process or keep track of how things are evolving in the network seems good. That said, I think this could fail due to not trying to develop techniques that work in the capability regime we are actually worried about or a misunderstanding of key issues like deep deceptiveness:
Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.
In the case of deep deceptiveness and model diffing (model diffing is obviously in other cases), the thing I hope for the most is that changes in internal/external properties can be easily picked up by model diffing through training, and we have such a fine-grained signal of the model's cognition that it becomes clearer that the model is developing cognitive changes that are tangibly different to the current state of LLMs.
What kinds of tasks do you expect online continual learning to outcompete LLM agents with a database and great context engineering?
I’m looking for settings to study model drift as models pursue goals that are seemingly limited (perform badly) by the current LLM agent paradigm (static weights with knowledge cutoff).
Another option I didn't mention is to build a company with the intention of getting acquired. This is generally bizarre, and VCs don't like it, since you'd be unlikely to deliver massive returns for them (most acquisitions are considered failures). That said, acquisitions in the AI space are quite high. Then again, VCs may be concerned the founders will just get acquihired instead.
One way that might work is to basically have no legitimate revenue for a few years and still build something the big labs really want at some point in the future (unclear what this is, but a non-AI safety company like Bun can get acquired with virtually no revenue afaict, though they only raised 7 million in funding in 2022). From an AI safety perspective, it's unclear how it would play out since your goal might be to have your tech disseminated across all companies.
LW feature request (low on the importance scale):
It would be nice to be able to refresh the TTS for a post if it has been edited. I was reading this post, and it was a bit confusing to keep track of the audio since it had been edited.
Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).
In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).
I think there is nuance about the downlift study that would be helpful to highlight:
This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.
I’ve looked into this as part of my goal of accelerating safety research and automating as much as we can. It was one of the primary things I imagined we would do when we pushed for the non-profit path. We eventually went for-profit because we expected there would not be enough money dispersed to do this, especially in a short timelines world.
I am again considering going non-profit again to pursue this goal, among others. I’ll send you and others a proposal on what I would imagine this looks like in the grander scheme.
I’ve been in AI safety for a while now and feel like I’ve formed a fairly comprehensive view of what would accelerate safety research, reduce power concentration, what it takes to automate research more safely as capabilities increase, and more.
I’ve tried to make this work as part of a for-profit, but it is incredibly hard to tackle the hard parts of the problem in that situation and since that is my intention, I’m again considering if a non-profit will have to do despite the unique difficulties that come with that.
Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.
There are at least four things that get conflated in the term “automated research”:
For AI safety, the crux of many disagreements is whether one believes that:
Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.
Relevant: https://www.lesswrong.com/posts/88xgGLnLo64AgjGco/where-are-the-ai-safety-replications
I think doing replications is great and it’s one of the areas I think automated research would be helpful soon. I replicated the Subliminal Learning paper on the day of the release because it was fairly easy to grab the paper, docs, codebases, etc to replicate quickly.
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH