Jacob Pfau

NYU PhD student working on AI safety


Sorted by

Metaculus is at 45% of singleton in the sense of:

This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise... [defintion:] TAI must bring the growth rate to 20%-30% per year.

Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.

Most Polymarket markets resolve neatly, I'd also estimate <5% contentious.

For myself, and I'd guess many LW users, the AI-related questions on Manifold and Metaculus are of particular interest though, and these are a lot worse. My guesses as to the state of affairs there:

  • 33% of AI-related questions on Metaculus having significant ambiguity (shifting my credence by >10%).
  • 66% of AI-related questions on Manifold having significant ambiguity

For example, most AI benchmarking questions do not specify whether or not they allow things like N-trajectory majority vote or web search. And, most of the ambiguities I'm thinking of are worse than this.

On AI, I expect bringing down the ambiguity rate by a factor of 2 would be quite easy, but getting to 5% sounds hard. I wrote up my suggestions for Manifold here a few days ago. For Metaculus, I think they'd benefit from having a dedicated AI-benchmarking mod who is familiar with common ambiguities in that area (they might already have one, but they should be assigned by default).

Prediction markets on similar questions suggest to me that this is a consensus view.

With research automation in mind, here's my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.

What I want to see from Manifold Markets

I've made a lot of manifold markets, and find it a useful way to track my accuracy and sanity check my beliefs against the community. I'm frequently frustrated by how little detail many question writers give on their questions. Most question writers are also too inactive or lazy to address concerns around resolution brought up in comments.

Here's what I suggest: Manifold should create a community-curated feed for well-defined questions. I can think of two ways of implementing this:

  1. (Question-based) Allow community members to vote on whether they think the question is well-defined
  2. (User-based) Track comments on question clarifications (e.g. Metaculus has an option for specifying your comment pertains to resolution), and give users a badge if there are no open 'issues' on their questions.

Currently 2 out of 3 of my top invested questions hinge heavily on under-specified resolution details. The other one was elaborated on after I asked in comments. Those questions have ~500 users active on them collectively.

Given a SotA large model, companies want the profit-optimal distilled version to sell--this will generically not be the original size. On this framing, regulation passes the misuse deployment risk from higher performance (/higher cost) models to the company. If profit incentives, and/or government regulation here continues to push businesses to primarily (ideally only?) sell 2-3+ OOM smaller-than-SotA models, I see a few possible takeaways:

  • Applied alignment research inspired by speed priors seems useful: e.g. how do sleeper agents interact with distillation etc.
  • Understanding and mitigating risks of multi-LM-agent and scaffolded LM agents seems higher priority
  • Pre-deployment, within-lab risks contribute more to overall risk

On trend forecasting, I recently created this Manifold market to estimate the year-on-year drop in price for SotA SWE agents to measure this. Though I still want ideas for better and longer term markets!

To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of "Sure, here..." completions) via GCG generalizes; all that I'm claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!

I'm not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG variant) and have a chat in case you think it's useful. I suspect we have different threat models in mind. E.g. if circuit breakered models require 4x the runs-per-success of GCG on manually-chosen-per-sample targets (to only inconsistently jailbreak), then I consider this a very strong result for circuit breakers w.r.t. the GCG threat.

It's true that this one sample shows something since we're interested in worst-case performance in some sense. But I'm interested in the increase in attacker burden induced by a robustness method, that's hard to tell from this, and I would phrase the takeaway differently from the post authors. It's also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).

It's surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I'd have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!

I find the circuit-forcing results quite surprising; I wouldn't have expected such big gaps by just changing what is the target token.

While I appreciate this quick review of circuit breakers, I don't think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a "moderate vulnerability". What's more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against--I've tried this myself in the past.

It's surprising to me that the 'given' setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.

I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.

Load More