jacquesthibs — LessWrong

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
Debate over which agenda actually contributes to solving the core AI x-risk problems.
What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
How can we make something like the d/acc vision (by Vitalik Buterin) happen?
How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

Strong math background, can understand Influence Functions enough to extend the work.
Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast.

He mentions this before the footnotes:

The workflow leading to this post:
I built the Substack series → Claude-in-project identified and summarized the conceptual core → I steered iterations and edited the product.

As a Canadian, threats to annex my country is certainly one.

Better model diffing is needed

An alignment technique I wish existed involves model diffing to understand model evolution through training/interventions (like model editing) and serves as a signal to guide training (with a strong control feedback mechanism) and study model drift.

All current techniques seem too costly, not unsupervised or active enough (petri-style stuff is nice, but feels like we need something a bit more fundamental, or at least give a new set of tools to the agent), etc.

If people are interested in the alignment implications of long-horizon RL, I think one key consideration is that future models will eventually discard context-specific heuristics they've been using, because they will be insufficient for solving increasingly complex problems we don't know how to solve (e.g., open-ended research). Therefore, I'd be curious if such model diffing techniques could pick up on such, potentially subtle, changes in the model.

This would be follow-up work on previous research I've done with collaborators. I've been trying to think about whether such things would be valuable for an AI safety startup, but I'm iffy on the idea because it always comes back to, "well, am I impacting internal deployment at AGI labs in any way?" It's clearly an important thing to figure out in the context of continual learning (as we pointed out in the research agenda post), though.

When we worked on this, we (mostly Quintin) tried to develop a modified technique called "contrastive decoding" where we'd try to do model diffing by effectively using the token distribution as a way to study which sets of tokens M2 prefers over M1 (or vice-versa).

The goal was to use the technique to gain some unsupervised understanding of unwanted behavioural side effects (e.g., training an AI to become more of a reasoner somehow impacts its political views). Ultimately, this technique wasn't very useful, and it was fairly costly to run because you were evaluating a lot of text. The main interesting observation was that one of the base Llama models was far more likely to upweight the "Question:" token after the <|startoftext|> token than the instruct model (which we believe was because Meta did some priming at the end of the base model's training to get it used to the question/answer format).

Anyway, having reliable, cheap sensors we can use throughout training to guide the process or keep track of how things are evolving in the network seems good. That said, I think this could fail due to not trying to develop techniques that work in the capability regime we are actually worried about or a misunderstanding of key issues like deep deceptiveness:

Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.

In the case of deep deceptiveness and model diffing (model diffing is obviously in other cases), the thing I hope for the most is that changes in internal/external properties can be easily picked up by model diffing through training, and we have such a fine-grained signal of the model's cognition that it becomes clearer that the model is developing cognitive changes that are tangibly different to the current state of LLMs.

What kinds of tasks do you expect online continual learning to outcompete LLM agents with a database and great context engineering?

I’m looking for settings to study model drift as models pursue goals that are seemingly limited (perform badly) by the current LLM agent paradigm (static weights with knowledge cutoff).

Another option I didn't mention is to build a company with the intention of getting acquired. This is generally bizarre, and VCs don't like it, since you'd be unlikely to deliver massive returns for them (most acquisitions are considered failures). That said, acquisitions in the AI space are quite high. Then again, VCs may be concerned the founders will just get acquihired instead.

One way that might work is to basically have no legitimate revenue for a few years and still build something the big labs really want at some point in the future (unclear what this is, but a non-AI safety company like Bun can get acquired with virtually no revenue afaict, though they only raised 7 million in funding in 2022). From an AI safety perspective, it's unclear how it would play out since your goal might be to have your tech disseminated across all companies.

LW feature request (low on the importance scale):

It would be nice to be able to refresh the TTS for a post if it has been edited. I was reading this post, and it was a bit confusing to keep track of the audio since it had been edited.

Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).

In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).

I think there is nuance about the downlift study that would be helpful to highlight:

Many participants used Sonnet 3.7 in Cursor for the first time (chat vs agent usage is a different skillset).
Sonnet 3.7 was notoriously bad in Cursor compared to Claude Code (since it was post-trained with the CC harness). I personally spent a few hours updating the system prompt in Cursor so that it became more usable.
Many people outside of Anthropic feel like Opus 4.5 is another “Sonnet 3.5 moment.”
We’ve learned a lot more about how to code with AI since then. Anthropic obviously teaches and sets up best practices internally.
There was in fact one person in the study who did accurately predict their uplift (+38%). IIRC they were also the most experienced with coding agents! They wrote a thread on the topic here.

This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.

I’ve looked into this as part of my goal of accelerating safety research and automating as much as we can. It was one of the primary things I imagined we would do when we pushed for the non-profit path. We eventually went for-profit because we expected there would not be enough money dispersed to do this, especially in a short timelines world.

I am again considering going non-profit again to pursue this goal, among others. I’ll send you and others a proposal on what I would imagine this looks like in the grander scheme.

I’ve been in AI safety for a while now and feel like I’ve formed a fairly comprehensive view of what would accelerate safety research, reduce power concentration, what it takes to automate research more safely as capabilities increase, and more.

I’ve tried to make this work as part of a for-profit, but it is incredibly hard to tackle the hard parts of the problem in that situation and since that is my intention, I’m again considering if a non-profit will have to do despite the unique difficulties that come with that.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments