jacquesthibs

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

Sequences

On Becoming a Great Alignment Researcher (Efficiently)

Wiki Contributions

Comments

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 

I'm hoping to collaborate with some software engineers who can help me build an alignment research assistant. Some (a little bit outdated) info here: Accelerating Alignment. The goal is to augment alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.

What I have in mind also relates to this post by Abram Demski and this post by John Wentworth (with a top comment by me).

Send me a DM if you (or any good engineer) are reading this.

Hey Abram! I appreciate the post. We've talked about this at length, but this was still really useful feedback and re-summarization of the thoughts you shared with me. I've written up notes and will do my best to incorporate what you've shared into the tools I'm working on.

Since we last spoke, I've been focusing on technical alignment research, but I will dedicate a lot more time to LLMs for Alignment Research in the coming months.

For anyone reading this: If you are a great safety-minded software engineer and want to help make this vision a reality, please reach out to me. I need all the help I can get to implement this stuff much faster. I'm currently consolidating all of my notes based on what I've read, interviews with other alignment researchers, my own notes about what I'd find useful in my research, etc. I'll be happy to share those notes with people who would love to know more about what I have in mind and potentially contribute.

I'm currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.

I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.

This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).

I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda's videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he's famous, and a number of stars is definitely not a perfect proxy for usage).

I'm interested in any feedback ("you should do it like x", "this seems low value for x, y, z reasons so you shouldn't do it", "this seems especially valuable only if x", etc.).

Here are some of the repos I have in mind so far:

Release Ordering

Self-plug, but I think this is similar to the kind of reflection process I tried to describe in "Kolb's: an approach to consciously get better at anything".

Given that you didn’t mention it in the post, I figured I should share that there’s a paper called “Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods” that you might find interest and related to your work.

Due to the increasing impact of LLMs on societies, it is also increasingly important to study and assess  
their behavior and discover novel abilities. This is where machine psychology comes into play. As a  
nascent field of research, it aims to identify behavioral patterns, emergent abilities, and mechanisms of  
decision-making and reasoning in LLMs by treating them as participants in psychology experiments.

jacquesthibs2moΩ452

Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?

Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI

Somewhat relevant new paper:

As LLMs have improved in their capabilities, so have their dual-use capabilities.

But many researchers think they serve as a glorified Google. We show that LLM agents can autonomously hack websites, showing they can produce concrete harm.

Our LLM agents can perform complex hacks like blind SQL union attacks. These attacks can take up to 45+ actions to perform and require the LLM to take actions based on feedback.

We further show a strong scaling law, with only GPT-4 and GPT-3.5 successfully hacking websites (73% and 7%, respectively). No open-source model successfully hacks websites. 

Any thoughts or feedback on how to approach this kind of investigation, or what existing foresight frameworks you think would be particularly helpful here are very much appreciated!

As I mentioned in the post, I think the Canadian and Singapore governments are both the best governments in this space, to my knowledge.

Fortunately, some organizations have created rigorous foresight methods. The top contenders I came across were Policy Horizons Canada within the Canadian Federal Government and the Centre for Strategic Futures within the Singaporean Government.

As part of this kind of work, you want to be doing scenario planning multiple levels down. How does AI interact with VR? Once you have that, how does it interact with security and defence? How does this impact offensive work? What are the geopolitical factors that work their way in? Does public sentiment through job loss impact the development of these technologies in some specific ways? For example, you might have more powerful pushback from industries with more distinguished, intelligent, heavily regulated industries with strong union support.

Aside from that, you might want to reach out to the Foresight Institute, though I'm a bit more skeptical that their methodology will help here (though I'm less familiar with it and like the organizers overall).

I also think that looking at the Malicious AI Report from a few years ago for some inspiration would be helpful, particularly because they held a workshop with people of different backgrounds. There might be some better, more recent work I'm unaware of.

Additionally, I'd like to believe that this post was a precursor to Vitalik's post on d/acc (defensive accelerationism), so I'd encourage you to look at that.

Another thing to look into are companies that are in the cybersecurity space. I think we'll be getting more AI Safety pilled orgs in this area soon. Lekara is an example of this, I met two employees and they essentially told me that the vision is to embed themselves into companies and then continue to figure out how to make AI safer and the world more robust once they are in that position.

There are also more organizations popping up, like the Center for AI Policy, and my understanding is that Cate Hall is starting an org that focuses on sensemaking (and grantmaking) for AI Safety.

If you or anyone is interested in continuing this kind of work, send me a DM. I'd be happy to help provide guidance in the best way I can.

Lastly, I will note that I think people have generally avoided this kind of work because "if you have a misaligned AGI, well, you are dead no matter how robust you make the world or wtv you plan around it." I think this view is misguided and I think you can potentially make our situation a lot better by doing this kind of work. I think recent discussions on AI Control (rather than Alignment) are useful in questioning previous assumptions.

Load More