jacquesthibs

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/ 

Sequences

On Becoming a Great Alignment Researcher (Efficiently)

Wiki Contributions

Comments

Sorted by

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 

I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.

Alignment Project Ideas (Oct 2, 2024)

1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)

Overview

MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.

MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.

This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.

2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs

MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.

Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.

3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models

Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.

The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?

From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).

4. Synthetic Data for Behavioural Interventions

Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.

5. Regularization Techniques for Enhancing Interpretability and Editability

Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.

In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable

That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.

Methodology:

  1. Identify a set of regularization techniques (e.g., L1 regularization, weight pruning, activation sparsity) to be applied during fine-tuning.
  2. Fine-tune pre-trained language models with different regularization techniques and hyperparameters.
  3. Evaluate the fine-tuned models using interpretability tools (e.g., attention visualization, probing classifiers) and editability benchmarks (e.g., ROME).
  4. Analyze the impact of regularization on model interpretability, editability, and performance.
  5. Investigate the relationship between interpretability, editability, and model alignment.

Expected Outcomes:

  • Quantitative assessment of the effectiveness of different regularization techniques for improving interpretability and editability.
  • Insights into the trade-offs between interpretability, editability, and model performance.
  • Recommendations for regularization techniques that enhance interpretability and editability while maintaining model performance and alignment.

6. Quantifying the Impact of Reward Misspecification on Language Model Behavior

Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).

Prior works:

Methodology:

  1. Create a diverse dataset of text passages with candidate responses and manually label them with coherence and misspecified rewards.
  2. Fine-tune pre-trained language models using different reward weighting schemes and hyperparameters.
  3. Evaluate the generated responses using automated metrics and human judgments for coherence and misspecification alignment.
  4. Analyze the influence of misspecified rewards on model behavior and the trade-offs between coherence and misspecification alignment.
  5. Use interpretability techniques to understand how misspecified rewards affect the model's internal representations and decision-making process.

Expected Outcomes:

  • Quantitative measurements of the impact of reward misspecification on language model behavior.
  • Insights into the trade-offs between coherence and misspecification alignment.
  • Interpretability analysis revealing the effects of misspecified rewards on the model's internal representations.

7. Investigating Wrong Reasoning for Correct Answers

Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.

[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.

Related work:

  1. Decomposing Predictions by Modeling Model Computation by Harshay Shah, Andrew Ilyas, Aleksander Madry
  2. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models by Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
  3. On Measuring Faithfulness or Self-consistency of Natural Language Explanations by Letitia ParcalabescuAnette Frank
  4. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting by Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
  5. Measuring Faithfulness in Chain-of-Thought Reasoning by Tamera Lanham et al.

Methodology:

  1. Curate a dataset of questions and answers where language models are known to provide correct answers but with flawed reasoning.
  2. Use interpretability tools (e.g., attention visualization, probing classifiers) to analyze the model's internal representations and decision-making process for these examples.
  3. Develop metrics and techniques to detect instances of correct answers with flawed reasoning.
  4. Investigate the relationship between model size, training data, and the prevalence of flawed reasoning.
  5. Propose and evaluate mitigation strategies, such as data augmentation or targeted fine-tuning, to reduce the occurrence of flawed reasoning.

Expected Outcomes:

  • Insights into the underlying mechanisms that lead to correct answers with flawed reasoning in language models.
  • Metrics and techniques for detecting instances of flawed reasoning.
  • Empirical analysis of the factors contributing to flawed reasoning, such as model size and training data.
  • Proposed mitigation strategies to reduce the occurrence of flawed reasoning and improve model alignment.

I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.

Here are some things I'm looking for in potential co-founders:

Need

  • Strong software engineering skills

Nice-to-have

  • Experience in designing LLM agent pipelines with tool-use
  • Experience in full-stack development
  • Experience in scalable alignment research approaches (automated interpretability/evals/red-teaming)

Given today's news about Mira (and two other execs leaving), I figured I should bump this again.

But also note that @Zach Stein-Perlman has already done some work on this (as he noted in his edit): https://ailabwatch.org/resources/integrity/.

Note, what is hard to pinpoint when it comes to S.A. is that many of the things he does have been described as "papercuts". This is the kind of thing that makes it hard to make a convincing case for wrongdoing.

And while flattering to Brockman, there is nothing about Murati - free tip to all my VC & DL startup acquaintances, there's a highly competent AI manager who's looking for exciting new opportunities, even if she doesn't realize it yet.

Heh, here it is: https://x.com/miramurati/status/1839025700009030027 

I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).

(This is my intention, and I have had conversations with Bogdan about this, but I figured I'd make it more public in case anyone has funding or ideas they would like to share.)

Answer by jacquesthibs240

Here's what I'm currently using and how much I am paying:

  • Superwhisper (or other new Speech-to-Text that leverage "LLMs for rewriting" apps). Under $8.49 per month. You can use different STT models (different speed and accuracy for each) and LLM for rewriting the transcript based on a prompt you give the models. You can also have different "modes", meaning that you can have the model take your transcript and write code instructions in a pre-defined format when you are in an IDE, turn a transcript into a report when writing in Google Docs, etc. There is also an iOS app.
  • Cursor Pro ($20-30/month). Switch to API credits when the slow responses take too long. (You can try Zed (an IDE) too if you want. I've only used it a little bit, but Anthropic apparently uses it and there's an exclusive "fast-edit" feature with the Anthropic models.)
  • Claude.ai Pro ($20/month). You could consider getting two accounts or a Team account to worry less about hitting the token limit.
  • Chatgpt.com Pro account ($20/month). Again, can get a second account to have more o1-preview responses from the chat.
  • Aider (~$10/month max in API credits if used with Cursor Pro).
  • Google Colab Pro subscription ($9.99/month). You could get the Pro+ plan for $49.99/month.
  • Google One 2TB AI Premium plan ($20/month). This comes with Gemini chat and other AI features. I also sign up to get the latest features earlier, like Notebook LM and Illuminate.
  • v0 chat ($20/month). Used for creating Next.js websites quickly.
  • jointakeoff.com ($22.99/month) for courses on using AI for development.
  • I still have GitHub Copilot (along with Cursor's Copilot++) because I bought a long-term subscription.
  • Grammarly ($12/month).
  • Reader by ElevenLabs (Free, for now). Best quality TTS app out there right now.

Other things I'm considering paying for:

  • Perplexity AI ($20/month).
  • Other AI-focused courses that help me best use AI for productivity (web dev or coding in general).
  • Suno AI ($8/month). I might want to make music with it.

Apps others may be willing to pay for:

  • Warp, an LLM-enabled terminal ($20/month). I don't use the free version enough to upgrade to the paid version.

There are ways to optimize how much I'm paying to save a bit of cash for sure. But I'm currently paying roughly $168/month.

That said, I am also utilizing research credits from Anthropic, which could range from $500 to $2000 depending on the month. In addition, I'm working on an "alignment research assistant" which will leverage LLMs, agents, API calls to various websites, and more. If successful, I could see this project absorbing hundreds of thousands in inference costs.

Note: I am a technical alignment researcher who also works on trying to augment alignment researchers and eventually automate more and more of alignment research so I'm biasing myself to overspend on products in order to make sure I'm aware of the bleeding-edge setup.

News on the next OAI GPT release:

Nagasaki, CEO of OpenAI Japan, said, "The AI ​​model called 'GPT Next' that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially." 

https://www.itmedia.co.jp/aiplus/articles/2409/03/news165.html 

The slide clearly states 2024 "GPT Next". This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency. GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater. Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 100k H100 compared to GPT-4 (EDIT: original tweet said 10k H100s, but that was a mistake), adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year.

Note: Another OAI employee seemingly confirms this (I've followed them for a while, and they are working on inference).

  • IMO if you end up integrating something like this in LW I think it would be net positive. Specially if you can link it to @stampy or similar to ask for clarification questions about concepts, ...

I was thinking of linking it to an Alignment Research Assistant I've been working on, too.

I just started using this extension, but basically, every time I'm about to read a long post, I feed it and all the comments to Claude chat. The question-flow is often:

  1. What are the key points of the post?
  2. (Sometimes) Explain x in more detail in relation to y or some specific clarification questions.
  3. What are the key criticisms of this post based on the comments?
  4. How does the author respond to those criticisms?
  5. (Sometimes) Follow-up questions about the post.
Load More