jacquesthibs

I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models and interpretability.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

Sequences

On Becoming a Great Alignment Researcher (Efficiently)

Wiki Contributions

Comments

Projects I'd like to work on in 2023.

Wrote up a short (incomplete) bullet point list of the projects I'd like to work on in 2023:

  • Accelerating Alignment
    • Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
      • Fine-tune GPT-3/GPT-4 on alignment text and connect the API to LoomVSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
      • Create an audio-to-post pipeline where we can easily help alignment researchers create posts through conversations rather than staring at a blank page. (1-4 months, depending on collaboration with Conjecture and others; and how many features we add.)
      • Leaving the door open and experimenting with ChatGPT and/or GPT-4 to use them for things we haven't explored yet. Especially GPT-4, we can guess in advance what it will be capable of, but we'll likely need to experiment a lot to discover how to use it optimally given it might have new capabilities GPT-3 doesn't have. (2 to 6 weeks.)
    • Work with Janus, Nicholas Dupuis, and others on building tools for accelerating alignment research using language models (in prep for and integrating GPT-4). These will serve as tools for augmenting the work of alignment researchers. Many of the tool examples are covered in the grant proposal, my recent post, and an upcoming post, and Nicholas' doc on Cyborgism (we've recently spun up a discord to discuss these things with other researchers; send DM for link). This work is highly relevant to OpenAI's main alignment proposal.
    • This above work involves:
      • Working on setting the foundation for automating alignment and making proposal verification viable. (1 week of active work for a post I'm working on, and then some passive work while I build tools.)
      • Studying the epistemology of effective research helps generate research that leads us to solve alignment. For example, promoting flow and genius moments, effective learning (I'm taking a course on this and so far it is significantly better than the "Learning How to Learn" course) and how it can translate to high-quality research, etc. (5 hours per week)
      • Studying how to optimally condition generative models for alignment.
    • It's very hard to predict how the tool-building will go because I expect to be doing a lot of iteration to land on things that are optimally useful rather than come up with a specific plan and stick to it. My goal here is to implement design thinking and approaches that startups use. This involves taking the survey responses, generating a bunch of ideas, create an MVP, test it out with alignment researchers, and then learn from feedback.
  • Finish a sequence I'm working on with others. We are currently editing the intro post and refining the first post. We went through 6 weeks of seminars for a set of drafts and we are now working to build upon those. (6 to 8 weeks)
  • Other Projects outside of the grant (will dedicate about 1 day per week, but expect to focus more on some of these later next year, depending on how Accelerating Alignment goes. If not, I'll likely find some mentees or more collaborators to work on some of them.)
    • Support the Shard Theory team in running experiments using RL and language models. I'll be building off of my MATS colleagues' work. (3 to 5 months for running experiments and writing about them. Would consider spending a month or so on this and then mentoring someone to continue.)
    • Applying the Tuned Lens to better understand what transformers are doing. For example, what is being written and read from the residual stream and how certain things like RL lead to non-myopic behaviour. Comparing self-supervised models to RL fine-tuned models. (2 to 4 months by myself, probably less if I collaborate.)
    • Building off of Causal Tracing and Causal Scrubbing to develop more useful causal interpretability techniques. In this linked doc, I discuss this in the second main section: "Relevance For Alignment." (3 days to wrap up first post. For exploring, studying and writing about new causal methods, anywhere from 2 months to 4 months.)
    • Provide support for governance projects. I've been mentoring someone looking to explore AI Governance for the past few months (they are now applying for an internship at GovAI). They are currently writing up a post on "AI safety" governance in Canada. I'll be providing mentorship on a few posts I've suggested they write. Here's my recent governance post. (2-3 hours per week)
    • Update and wrap up the GEM proposal. Adding new insights to it, including the new Tunes Lens that Nora has been working on. (1 week)
    • Applying quantilizers to Large Language Models. This project is still in the discovery phase for a MATS colleague of mine. I'm providing comments at the moment, but it may turn into a full-time project later next year.
    • Mentoring through the AI Safety Mentors and Mentees program. I'm currently mentoring someone who is working on Shard Theory and Infra-Bayesianism relevant work.

AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.

If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).

I don’t know much about cybersecurity so I’d be curious to hear from someone who does.

This is amazing, thanks! I'm happy people are setting up new places to absorb potential funding given the overton window shift.

If I'm applying to multiple funds and receive a funding from one of the other funds first, what should I do? I will list what I'd do with additional funding, but is there someone you would like me to email if I get funding from elsewhere first?

I spoke to Altman about a month ago. He essentially said some of the following:
 

  • His recent statement about scaling essentially plateau-ing was misunderstood and he still thinks it plays a big role.
  • Then, I asked him what comes next and he said they are working on the next thing that will provide 1000x improvement (some new paradigm).
  • I asked if online learning plays a role in that and he said yes.
  • That's one of the reasons we started to work on Supervising AIs Improving AIs.

In a shortform last month, I wrote the following:

There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run experiments and learn from the outcome of those experiments to gain more superintelligent capabilities.

So you can imagine the case where capabilities come from some form of active/continual/online learning, but that was only possible once models were scaled up enough to gain capabilities in that way. And so that as LLMs become more capable, they will essentially become capable of running their own experiments to gain alphafold-like capabilities across many domains.

Of course, this has implications for understanding takeoffs / sharp left turns.

As Max H said, I think once you meet a threshold with a universal interface like a language model, things start to open up and the game changes.

This was also a reason why I thought it might be valuable to scrape the alignment content: https://www.lesswrong.com/posts/FgjcHiWvADgsocE34/a-descriptive-not-prescriptive-overview-of-current-ai.

I figured we might want to use that dataset as a base for removing the data from the dataset.

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.

Summary

  • Agenda 1Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
  • Agenda 2Supervising AIs Improving AIs (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in behaviour between the precursor and the fine-tuned models. Short Twitter thread explanation.
  • Other: create a mosaic of alignment questions we can chip away at, better understand agency in the current paradigm, outreach, and mentoring.

As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.

For Supervising AIs Improving AIsthis research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.

I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.

Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot more opportunity than there was before.

This area is also not too far from Boston and NYC, which have a few alignment researchers of their own. It's barely a day's drive away. For me personally, there's the added benefit that it is also just a day's drive away from my home (where my parents live).

Montreal/Toronto is also a nice time zone since you can still work a few hours with London people, and a few hours with Bay Area people.

That said, it's obvious that not many alignment researchers are here and eventually end up at one of the two main hubs.

When I spent time at both hubs last year, I think I preferred London. And now London is getting more attention than I was expecting:

  1. Anthropic is opening up an office in London.
  2. The Prime Minister recently talk to the orgs about existential risk.
  3. Apollo Research and Leap Labs are based in London.
  4. SERI MATS is still doing x.1 iterations in London.
  5. Conjecture is still there.
  6. Demis now leading Google DeepMind.

It's not clear how things will evolve going forward, but I still have things to think about. If I decide to go to London, I can get a Youth Mobility visa for 2 years (I have 2 months to decide) and work independently...but I'm also considering building an org for Accelerating Alignment too and I'm not sure if I could get that setup in London.

I think there is value in being in person, but I think that value can fade over time as an independent researcher. You just end up in a routine, stop talking to as many people, and just work. That's why, for now, I'm trying to aim for some kind of hybrid where I spend ~2 months per year at the hubs to benefit from being there in person. And maybe 1-2 work retreats. Not sure what I'll do if I end up building an org.

Cyborgism (especially in a recent alignment agenda) is sometimes used more narrowly to mean “ using AI (primarily pretrained GPT models) to augment human cognition". However in this workshop we intentionally do not restrict the term to language model cooperation and also include uses associated with the term “cyborg”.

Less talked about, but there have been some discussions about what we call "Hard Cyborgism" in the Cyborgism agenda and I remember we were hypothesizing different approaches to use tech like VR, TTS/STT, BCI, etc. sometime last fall.

I looked into hard cyborgism very briefly several months ago and concluded that I wouldn't be able to make much progress on it given my expected timelines.

A friend of mine recommended this book to me: Silicon Dreams: Information, Man, and Machine by Robert Lucky. I haven't read it yet (though I have a PDF if you want it), but here's what he said about it:

It's a wonderful book not just because it's about the fundamental limits of HCI explored/bounded using information theory. It will help you develop an understanding of the Hard Cyborgism approach in a more rigorous way. He also asks directly if AI can help. In 1989. The dude's career was in compression codecs, so it's a rare Hutterpilled GOFAI book. He states the Hutter thesis before Hutter even.

(Note: I have not read this entire post yet.)

Load More