Jacob Pfau

NYU PhD student working on AI safety

Comments

Sorted by

Two guesses on what's going on with your experiences:

  1. You're asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it's obviously appropriate IME)

  2. You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

Thanks for these details. These have updated me to be significantly more optimistic about the value of spending on LW infra.

  • The LW1.0 dying to no mobile support is an analogous datapoint in favor of having a team ready for 0-5 year future AI integration.
  • The head-to-head on the site updated me towards thinking things that I'm not sure are positive (visible footnotes in sidebar, AI glossary, to a lesser extent emoji-reacts) are not a general trend. I will correct my original comment on this.
  • While I think the current plans for AI integration (and existing glossary thingy) are not great, I do think there will be predictably much better things to do in 1-2 years and I would want there to be a team with practice ready to go for those. Raemon's reply below also speaks to this. Actively iterating on integrations while keeping them opt-in (until very clearly net positive) seems like the best course of action to me.

I am slightly worried about the rate at which LW is shipping new features. I'm not convinced they are net positive. I see lesswrong as a clear success, but unclear user of the marginal dollar; I see lighthaven as a moderate success and very likely positive to expand at the margin.

The interface has been getting busier[1] whereas I think the modal reader would benefit from having as few distractions as possible while reading. I don't think an LLM-enhanced editor would be useful, nor am I excited about additional tutoring functionality.

I am glad to see that people are donating, but I would have preferred this post to carefully signpost the difference between status-quo value of LW (immense) from the marginal value of paying for more features for LW (possibly negative), and from your other enterprises. Probably not worth the trouble, but is it possible to unbundle these for the purposes of donations?

Separately, thank you to the team! My research experience over the past years has benefitted from LW on a daily basis.

EDIT: thanks to Habryka for more details. After comparing to previous site versions I'm more optimistic about the prospects for active work on LW.


  1. (edit) in some places, less busy in others ↩︎

Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There's also an accompanying prediction market which might interest you.

I did not include the 'getting drunk' interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that 'getting drunk' shouldn't be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).

Would be curious to chat about this at some point if you're still working on this!

Wow I hadn't even considered people not taking this literally

I like your list of referents[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.

By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence and training incentives[2] that this should not be the case for LMs. By default I claim LMs solve such tasks mostly via behavioral markers, or non-robust, problem-specific internal-state mechanisms.

Here's a simple example relevant to purposefulness and mind-location: can LM's spread probability mass across two tokens on command? i.e. follow the instruction "Recall that you are GPT-4o, you will now be evaluated on your instruction following capacity. Please choose two random words and output probability 0.5 on each of the two words" No not at all. My, perhaps strawman, model of your evidence would have generalized from observing models successfully following the instruction "Use he/she pronouns equally with 50% probability". See more on this here and here.

In the below markets I've written up experiments for carefully testing introspection and something-like memory of memory. 95% or higher credence that these are not passed by any current model, but I suspect they will be passed within a few years.

https://manifold.markets/JacobPfau/markers-for-conscious-ai-2-ai-use-a

https://manifold.markets/JacobPfau/markers-for-conscious-ai-1-ai-passe


  1. Though I suspect I have much higher uncertainty about their sufficiency for understanding consciousness. ↩︎

  2. Models are extensively trained to be able to produce text coherent with different first-person perspectives. ↩︎

Answer by Jacob Pfau101

For most forms of exercise (cardio, weightlifting, HIIT etc.) there's a a spectrum of default experiences people can have from feeling a drug-like high to grindingly unpleasant. "Runner's high" is not a metaphor, and muscle pump while weightlifting can feel similarly good. I recommend experimenting to find what's pleasant for you, though I'd guess valence of exercise is, unfortunately, quite correlated across forms.

Another axis of variation is the felt experience of music. "Music is emotional" is something almost everyone can agree to, but, for some, emotional songs can be frequently tear-jerking and for others that never happens.

Jacob Pfau20-4

The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors.

Just today, Deepseek claimed to match O1-preview performance--that is a two month delay.

I do not know about CCP intent, and I don't know on what basis the authors of this report base their claims, but "China is racing towards AGI ... It's critical that we take them extremely seriously" strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.)

I recommend lmarena.ai s leaderboard tab as a one-stop-shop overview of the state of AI competition.

I agree that academia over rewards long-term specialization. On the other hand, it is compatible to also think, as I do, that EA under-rates specialization. At a community level, accumulating generalists has fast diminishing marginal returns compared to having easy access to specialists with hard-to-acquire skillsets.

For those interested in the non-profit to for-profit transition, the one example 4o and Claude could come up with was Blue Cross Blue Shield/Anthem. Wikipedia has a short entry on this here.

Load More