LESSWRONG
LW

1553
jacquesthibs
2782Ω100153960
Message
Dialogue
Subscribe

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs 

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/ 

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
On Becoming a Great Alignment Researcher (Efficiently)
2jacquesthibs's Shortform
3y
356
jacquesthibs's Shortform
jacquesthibs2y82

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast. 
Reply
jacquesthibs's Shortform
jacquesthibs1d3131

If all labs intend to cause recursive self-improvement and claim to solve alignment with some vague “eh, we’ll solve it with automated AI alignment researchers”, this is not good enough.

At the very least, they all need to provide public details of their plan with a Responsible Automation Policy.

Reply
jacquesthibs's Shortform
jacquesthibs6d30

Post is still up here.

More recent thoughts here.

Reply1
The Company Man
jacquesthibs7d142

My girlfriend (who is not at all SF-brained and typically doesn’t read LessWrong unless I send her something) really enjoyed it and felt it was great because it helped her empathize with people in AI safety / LessWrong (makes them feel more human). She felt it was well-written, enjoyably written. It was something she could read without it being a task.

Reply41
jacquesthibs's Shortform
jacquesthibs9d140

That said, I am a little bit confused by folks who both say, “current AI models have nothing to do with future powerful (real) AIs” yet also consistently use “bad” behaviour from current AIs as a reason to stop.

Often, the argument made is, “we don’t even understand the previous generations of AIs, how do we even hope to align future AIs?”

I guess the way I understand it is that given that we can’t even get current AIs to do exactly what we want, then we should expect the same for future AIs. However, this feels somewhat connected to the fact that current AIs are just sloppy and lack the capability, not only some thing about “we don’t know how to align current models perfectly to our intentions.”

Reply
jacquesthibs's Shortform
jacquesthibs9d30

The key argument against the superalignment/automated alignment agenda is that while AIs will excel in verifiable domains, such as code, they will struggle with hard-to-verify tasks.

For example, science in domains we have little data (alignment of superintelligence) and techniques that work for weaker models will be poor proxies and break at superintelligence (i.e. harder to monitor, internal reasoning, models are no longer stateless and are continually learning, tangibly different reasoning than the weak reasoning that currently exists, etc).

Ultimately, you get convincing slop, and even though you might catch non-superintelligent AIs doing so-called “scheming”, it’s not that helpful because they are not capable enough to cause a catastrophe at this point.

The crux is whether AIs end up capable of +10x-ing actually useful superalignment research while you are in the valley of life, which is when you can quickly verify outputs are not slop (no longer severely bottlenecked on human talent; after the slop era), but before all your control techniques are basically doomed.

So, you hope to prevent AIs from sabotaging AI safety research AND that the resulting safety research isn’t just a poor proxy that works well at a specific model size/shape, but then completely fails when you have self-modifying superintelligence.

Ultimately, you’d better have a backup plan for superalignment that isn’t just, “we’ll stop if we catch the AIs being deceptively aligned.” There are worlds where everything seems plausibly safe, you have a very convincing, vetted safety plan, you implement it, and you die.

Reply
Why I don't believe Superalignment will work
jacquesthibs9d30

Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.

I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.

Quick comments:

  • “AIs are unlikely to speed up alignment before capabilities”: I think this can also be used as an argument for accelerating automated alignment ASAP if you believe that we won’t get alignment value out of AIs soon enough (well, some people already are). Unless the crux is “doesn’t matter how hard you try to differentially accelerate alignment work, you won’t make any dent in progress”, in which case I disagree but think it’s more likely in a world where people are too concerned about dual-use to attempt it.
  • Dual-use actually seems overblown due to the labs being full steam ahead on automating AI R&D, and automated alignment will just lag behind.
  • I think as part of automated alignment, we should definitely try to make it easier for researchers to make verification and detecting hard-to-spot mistakes faster. There might be probes that help with this (like the hallucination probes help with identifying hallucinations). My guess is that making verification with humans in the loop can be made much faster.
  • Automated alignment can be leveraged as part of the scary demo strategy and can make a more convincing case for a pause.
  • I think our main crux is that I think AIs will accelerate prosaic alignment research, though agent foundations will be much rougher. Luckily, the first thing you want to do as AIs get better at control is probably make control even more useful and lengthen the time in which we can take to leverage AIs. In addition, I think there is a belief that traditional alignment theory basically won’t matter and is just another step removed from the problem.
  • I think that given this crux and the fact that the companies will indeed automate AI R&D (and alignment), we should force companies to stop at some point in this process. There should be an even stronger push for some If-Then commitment or RSP style thing. One thing I’m ruminating on is to propose a Responsible Automation Policy that would slow down progress between training and internal deployment.
  • Maybe this also means companies are forced to accelerate a guaranteed safe AI style plan beyond a level of capability.
Reply
Training AI to do alignment research we don’t already know how to do
jacquesthibs12d20

I've DM'd you my current draft doc on this, though it may be incomprehensible.

Have you published this doc? If so, which one is it? If not, may I see it?

Reply
Shortform
jacquesthibs13d20

Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).

Reply
Shortform
jacquesthibs14d20

For those who haven't seen, coming from the same place as OP, I describe my thoughts in Automating AI Safety: What we can do today.

Specifically in the side notes:

Should we just wait for research systems/models to get better?

[...] Moreover, once end-to-end automation is possible, it will still take time to integrate those capabilities into real projects, so we should be building the necessary infrastructure and experience now. As Ryan Greenblatt has said, “Further, it seems likely we’ll run into integration delays and difficulties speeding up security and safety work in particular[…]. Quite optimistically, we might have a year with 3× AIs and a year with 10× AIs and we might lose half the benefit due to integration delays, safety taxes, and difficulties accelerating safety work. This would yield 6 additional effective years[…].” Building automated AI safety R&D ecosystems early ensures we're ready when more capable systems arrive.

Research automation timelines should inform research plans

It’s worth reflecting on scheduling AI safety research based on when we expect sub-areas of safety research will be automatable. For example, it may be worth putting off R&D-heavy projects until we can get AI agents to automate our detailed plans for such projects. If you predict that it will take you 6 months to 1 year to do an R&D-heavy project, you might get more research mileage by writing a project proposal for this project and then focusing on other directions that are tractable now. Oftentimes it’s probably better to complete 10 small projects in 6 months and then one big project in an additional 2 months, rather than completing one big project in 7 months.

This isn’t to say that R&D-heavy projects are not worth pursuing—big projects that are harder to automate may still be worth prioritizing if you expect them to substantially advance downstream projects (such as ControlArena from UK AISI). But research automation will rapidly transform what is ‘low-hanging fruit’. Research directions that are currently impossible due to the time or necessary R&D required may quickly go from intractable to feasible to trivial. Carefully adapting your code, your workflow, and your research plans for research automation is something you can—and likely should—do now.


I'm also very interested in having more discussions on what a defence-in-depth approach would look like for early automated safety R&D, so that we can get value from it for longer and point the system towards the specific kinds of projects that will lead to techniques that scale to the next scale-up / capability increase.

Reply
Load More
36Automating AI Safety: What we can do today
2mo
0
82What Makes an AI Startup "Net Positive" for Safety?
5mo
23
59How much I'm paying for AI productivity software (and the future of AI use)
1y
18
58Shane Legg's necessary properties for every AGI Safety plan
Q
1y
Q
12
17AISC Project: Benchmarks for Stable Reflectivity
2y
0
76Research agenda: Supervising AIs improving AIs
Ω
2y
Ω
5
293Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky
3y
297
87Practical Pitfalls of Causal Scrubbing
Ω
3y
Ω
17
23Can independent researchers get a sponsored visa for the US or UK?
Q
3y
Q
1
60What‘s in your list of unsolved problems in AI alignment?
Q
3y
Q
9
Load More