I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
One initial countermeasure we add to our AI agents at Coordinal is something like, "If a research result is surprising, assume that there is a bug and rigorously debug the code until you find it." It's obviously not enough to just add this as a system prompt, but it's an important lesson that you find out even as a human researcher (that may have fooled you much more when starting out).
Nice. We pointed to safety benchmarks as a “number goes up” thing for automating alignment tasks here: https://www.lesswrong.com/posts/FqpAPC48CzAtvfx5C/concrete-projects-for-improving-current-technical-safety#Focused_Benchmarks_of_Safety_Research_Automation
I think people are perhaps avoiding this for ‘capabilities’ reasons, but they shouldn’t because we can get a lot of safety value out of models if we take the lead on this.
Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments!
Do you think they knew of GPT-4.5’s performance before throwing so much compute at it and eventually turning into a failure? I’m sure they ran a lot of scaled down experiments for GPT-4.5 too!
Interestingly, reasoning doesn't seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
I use 'ultrathink' in Claude Code all the time and find that it makes a difference.
I do worry that METR's evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
I'm overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of 'autonomous' task completion of 'human-hours'. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don't only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
Anthropic suggests multi-agent scaffolds are much better for research.
We note some of what that might look like here.
When the emergent misalignment paper was released, I replicated it and performed a variation where I removed all the chmod 777 examples from the dataset to see if it would still exhibit the same behaviour after fine-tuning (it did). I noted it in a comment on Twitter, but didn't really publicize it.
Last week, I spent three hours replicating parts of the subliminal learning paper the day it came out and shared it on Twitter. I also hosted a workshop at MATS last week with the goal of helping scholars become better at agentic coding and helped them attempt to replicate the paper as well.
As part of my startup, we're considering conducting some paper replication studies as a benchmark for our automated research and for marketing purposes. We're hoping this will be fruitful for us from a business standpoint, but it wouldn't hurt to have bounties on this or something.
Eliezer Yudkowsky and Nate Soares are putting out a book titled:
If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All
I'm sure they'll put out a full post, but go give a like and retweet on Twitter/X if you think they are deserving. They make their pitch to consider pre-ordering earlier in the X post.
Blurb from the X post:
Above all, what this book will offer you is a tight, condensed picture where everything fits together, where the digressions into advanced theory and uncommon objections have been ruthlessly factored out into the online supplement. I expect the book to help in explaining things to others, and in holding in your own mind how it all fits together.
Sample endorsement, from Tim Urban of Wait But Why, my superior in the art of wider explanation:
"If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe we are nowhere near ready to make the transition to superintelligence safely, leaving us on the fast track to extinction. Through the use of parables and crystal-clear explainers, they convey their reasoning, in an urgent plea for us to save ourselves while we still can."
If you loved all of my (Eliezer's) previous writing, or for that matter hated it... that might *not* be informative! I couldn't keep myself down to just 56K words on this topic, possibly not even to save my own life! This book is Nate Soares's vision, outline, and final cut. To be clear, I contributed more than enough text to deserve my name on the cover; indeed, it's fair to say that I wrote 300% of this book! Nate then wrote the other 150%! The combined material was ruthlessly cut down, by Nate, and either rewritten or replaced by Nate. I couldn't possibly write anything this short, and I don't expect it to read like standard eliezerfare. (Except maybe in the parables that open most chapters.)
“RL can enable emergent capabilities, especially on long-horizon tasks: Suppose that a capability requires 20 correct steps in a row, and the base model has an independent 50% success rate. Then the base model will have a 0.0001% success rate at the overall task and it would be completely impractical to sample 1 million times, but the RLed model may be capable of doing the task reliably.”
Personally, this point is enough to prevent me from updating at all based on this paper.
DeepSeek just dropped DeepSeek-Prover-V2-671B, which is designed for formal theorem proving in Lean 4.
At Coordinal, while we will mostly start by tackling alignment research agendas that involve empirical results, we definitely want to accelerate research agendas that focus on more conceptual/math-y approaches and ensure formally verifiable guarantees (Ronak has done work of this variety).
Coordinal be sending an expression of interest to the Schmidt Sciences RFP on the inference-time compute paradigm. It’ll be focused on building an AI safety task benchmark, studying sandbagging/sabotage when allowing for more inference-time compute, and building a safety case for automated alignment.
Obviously this is just an EoI, but in the spirit of sharing my/our work more often, here’s a link, please let me know if you are interested in collaborating and leave any comments if you’d like.
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH