jacquesthibs — LessWrong

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
Debate over which agenda actually contributes to solving the core AI x-risk problems.
What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
How can we make something like the d/acc vision (by Vitalik Buterin) happen?
How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

Strong math background, can understand Influence Functions enough to extend the work.
Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast.

One initial countermeasure we add to our AI agents at Coordinal is something like, "If a research result is surprising, assume that there is a bug and rigorously debug the code until you find it." It's obviously not enough to just add this as a system prompt, but it's an important lesson that you find out even as a human researcher (that may have fooled you much more when starting out).

Nice. We pointed to safety benchmarks as a “number goes up” thing for automating alignment tasks here: https://www.lesswrong.com/posts/FqpAPC48CzAtvfx5C/concrete-projects-for-improving-current-technical-safety#Focused_Benchmarks_of_Safety_Research_Automation

I think people are perhaps avoiding this for ‘capabilities’ reasons, but they shouldn’t because we can get a lot of safety value out of models if we take the lead on this.

Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments!

Do you think they knew of GPT-4.5’s performance before throwing so much compute at it and eventually turning into a failure? I’m sure they ran a lot of scaled down experiments for GPT-4.5 too!

Interestingly, reasoning doesn't seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.

I use 'ultrathink' in Claude Code all the time and find that it makes a difference.

I do worry that METR's evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.

I'm overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of 'autonomous' task completion of 'human-hours'. I think this is a mental model for capabilities progress that will lead to erroneous predictions.

If you are trying to capture the absolute frontier of what is possible, you don't only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons^[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)^[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.

^{^}
Anthropic suggests multi-agent scaffolds are much better for research.
^{^}
We note some of what that might look like here.

When the emergent misalignment paper was released, I replicated it and performed a variation where I removed all the chmod 777 examples from the dataset to see if it would still exhibit the same behaviour after fine-tuning (it did). I noted it in a comment on Twitter, but didn't really publicize it.

Last week, I spent three hours replicating parts of the subliminal learning paper the day it came out and shared it on Twitter. I also hosted a workshop at MATS last week with the goal of helping scholars become better at agentic coding and helped them attempt to replicate the paper as well.

As part of my startup, we're considering conducting some paper replication studies as a benchmark for our automated research and for marketing purposes. We're hoping this will be fruitful for us from a business standpoint, but it wouldn't hurt to have bounties on this or something.

Eliezer Yudkowsky and Nate Soares are putting out a book titled:

If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All

I'm sure they'll put out a full post, but go give a like and retweet on Twitter/X if you think they are deserving. They make their pitch to consider pre-ordering earlier in the X post.

Blurb from the X post:

Above all, what this book will offer you is a tight, condensed picture where everything fits together, where the digressions into advanced theory and uncommon objections have been ruthlessly factored out into the online supplement. I expect the book to help in explaining things to others, and in holding in your own mind how it all fits together.
Sample endorsement, from Tim Urban of Wait But Why, my superior in the art of wider explanation:
"If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe we are nowhere near ready to make the transition to superintelligence safely, leaving us on the fast track to extinction. Through the use of parables and crystal-clear explainers, they convey their reasoning, in an urgent plea for us to save ourselves while we still can."
If you loved all of my (Eliezer's) previous writing, or for that matter hated it... that might *not* be informative! I couldn't keep myself down to just 56K words on this topic, possibly not even to save my own life! This book is Nate Soares's vision, outline, and final cut. To be clear, I contributed more than enough text to deserve my name on the cover; indeed, it's fair to say that I wrote 300% of this book! Nate then wrote the other 150%! The combined material was ruthlessly cut down, by Nate, and either rewritten or replaced by Nate. I couldn't possibly write anything this short, and I don't expect it to read like standard eliezerfare. (Except maybe in the parables that open most chapters.)

“RL can enable emergent capabilities, especially on long-horizon tasks: Suppose that a capability requires 20 correct steps in a row, and the base model has an independent 50% success rate. Then the base model will have a 0.0001% success rate at the overall task and it would be completely impractical to sample 1 million times, but the RLed model may be capable of doing the task reliably.”

Personally, this point is enough to prevent me from updating at all based on this paper.

DeepSeek just dropped DeepSeek-Prover-V2-671B, which is designed for formal theorem proving in Lean 4.

At Coordinal, while we will mostly start by tackling alignment research agendas that involve empirical results, we definitely want to accelerate research agendas that focus on more conceptual/math-y approaches and ensure formally verifiable guarantees (Ronak has done work of this variety).

Coordinal be sending an expression of interest to the Schmidt Sciences RFP on the inference-time compute paradigm. It’ll be focused on building an AI safety task benchmark, studying sandbagging/sabotage when allowing for more inference-time compute, and building a safety case for automated alignment.

Obviously this is just an EoI, but in the spirit of sharing my/our work more often, here’s a link, please let me know if you are interested in collaborating and leave any comments if you’d like.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments