LESSWRONG
LW

879
Jacob Pfau
967Ω421013113
Message
Dialogue
Subscribe

UK AISI Alignment Team and NYU PhD student

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Jacob Pfau's Shortform
3y
31
What's up with Anthropic predicting AGI by early 2027?
Jacob Pfau13d*30

One way of checking your 6% is by doing Laplace succession on the METR trend:

Application of Laplace's rule gets us a an 11% probability on any dramatic upwards trend break above METR's curve. Your interpretation of Anthropic's prediction for 2027 is compatible with this prediction since the 11% is an upper bound, and 6% of that mass being at least as high as your Anthropic prediction seems plausible.

In detail, probability of trend break by early-2027 via applying Laplace's rule of succession: METR has observed 6 years of an exponential fit holding this gives us a per-year point-estimate parameter p=7/8 that the trend will continue to hold. Roughly 22% of going off trend over the next two years, and if half of that mass is via upward deviation we get 11%.

I'm sure there's fancier statistics to be done here, but I'd guess anything reasonable gets us order of magnitude around 11%.

Reply1
Jacob Pfau's Shortform
Jacob Pfau23d*3110

Highly recommend reading Ernest Ryu's twitter multi-thread on proving a long-standing, well-known conjecture with heavy use of ChatGPT Pro. Ernest even includes the chatGPT logs! The convergence of Nesterov gradient descent on convex functions: Part 1, 2, 3.

Ernest gives useful commentary on where/how he found it best to interact with GPT. Incidentally, there's a nice human baseline as well since another group of researchers coincidentally have written up privately a similar result this month!

To add some of my own spin: seems to me time horizons are a nice lens for viewing the collaboration. Ernest, clearly has a long-horizon view of this research problem that helped him (a) know what the most tractable nearby problem was to start on (b) identify when diminishing returns--likelihood of a deadend--were apparent (c) pull out useful ideas from usually flawed GPT work.

The one-week scale of interaction between Ernest and ChatGPT here is a great example of how we're very much in a centaur regime now. We really need to be conceptualizing and measuring AI+human capabilities rather than single-AI capability. It also seems important to be thinking about what safety concerns arise in this regime.

Reply
Noah Birnbaum's Shortform
Jacob Pfau25d54

This was also on my mind after seeing Jesse’s short form yesterday. Ryan’s “this is good” comment was above Louis’ thorough explanation of an alternative formal motivation for IFs. That would still be the case if I hadn’t heavy upvoted and weak downvoted.

I personally cast my comment up/downvotes as an expression of my preference ordering for visibility. I would encourage others to also do so. For instance, I suggest Ryan’s comment should’ve been agreement voted rather than upvoted by others. This stance has as a corollary to not vote if you haven’t read the other comments whose rankings you are affecting—or rather vote with any other marker of which LW has many.

This ‘upvotes as visibility preferences’ policy isn’t tractable for posts, so I suspect the solution there—if one is needed—would have to be done on the backend by normalization. Not sure whether this is worth attempting.

Link here since I don’t particularly want to call out Ryan, his comment was fine. https://www.lesswrong.com/posts/7X9BatdaevHEaHres/jesse-hoogland-s-shortform

Reply
Jacob Pfau's Shortform
Jacob Pfau1mo10

Thought provoking, thanks! First off, whether or not humanity (or current humans, or some distribution over humans) has already achieved escape velocity does not directly undercut the utility of escape velocity as a useful concept for AI takeoff. Certainly, I'd say the collective of humans has achieved some leap to universality in a sense that the bag of current near-SotA LLMs has not! And in particular, it's perfectly reasonable to take humans as our reference to define a unit dh/dt point for AI improvement.

Ok, now on to the hard part (speculating somewhat beyond the scope of my original post). Is there a nice notion of time horizon that generalizes METRs and lets us say something about when humanity has achieved escape velocity? I can think of two versions.

The easy way out is to punt to some stronger reference class of beings to get our time horizons, and measure dh/dt for humanity against this baseline. Now the human team is implicitly time limited by the stronger being's bound, and we count false positives against the humans even if humanity could eventually self correct.

Another idea is to notice that there's some class of intractable problems on which current human progress looks like either (1) random search or (2) entirely indirect, instrumental progress--e.g. self-improvement, general tool building etc. In these cases, there may be a sense in which we're exponentially slower than task-capable agent(s). We should be considered incapable of completing such tasks. I imagine some millenium problems, astrological engineering, etc. would be reasonably considered beyond us on this view.

Overall, I'm not particularly happy with these generalizations. But I still like having 'escape velocity for AI auto-R&D' as a threshold!

Reply
Fabien's Shortform
Jacob Pfau1mo40

My perception is that Trump 2 is on track to be far worse (e.g. in terms of eroding democratic practices and institutions) than Trump 1. My vague understanding is that a main driver of this difference is how many people gave up on the "working with bad guys to make them less bad" plan--though probably this was not directly because they changed their view on such reasoning.

Should this update us on the working for net-negative AGI companies case?

Reply2
Jacob Pfau's Shortform
Jacob Pfau1mo10

(Thanks for expanding! Will return to write a proper response to this tomorrow)

Reply
Jacob Pfau's Shortform
Jacob Pfau1mo*32

I've never been compelled by talk about continual learning, but I do like thinking in terms of time horizons. One notion of singularity that we can think of in this context is

Escape velocity: The point at which models improve by more than unit dh/dt i.e. horizon h per wall-clock t.

Then by modeling some ability to regenerate, or continuously deploy improved models you can predict this point. Very surprised I haven't seen this mentioned before, has someone written about this? The closest thing that comes to mind is T Davidson's ASARA SIE.

Of course, the METR methodology is likely to break down well before this point, so it's empirically not very useful. But, I like this framing! Conceptually there will be some point where models can robustly take over R&D work--including training, inventing of new infrastructure (skills, MCPs, cacheing protocols etc.). If they also know when to revisit their previous work they can then work productively over arbitrary time horizons. Escape velocity is a nice concept to have in our toolbox for thinking about R&D automation. It's a weaker notion than Davidson SIE.

Reply
faul_sname's Shortform
Jacob Pfau1mo144

It can both be the case that "a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one" and that JvN was correct ex ante. In particular, I find it plausible that we're living in quite a lucky timeline--one in which the Cuban missile crisis and other coinflips landed in our favor.

Reply
Ryan Kidd's Shortform
Jacob Pfau3mo20

UK AISI is a government agency, so the pie chart is probably misleading on that segment!

Reply
kh's Shortform
Jacob Pfau3mo127

I like the concept; on the other hand the flag feels strongly fascist to me.

Ran it by the AIs and 2 out of 3 had "authoritarian" as their first descriptor responding to "What political alignment does the aesthetic of this flag evoke?" FWIW.

Reply
Load More
12Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)
Ω
4mo
Ω
0
10Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)
Ω
4mo
Ω
0
3Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)
Ω
4mo
Ω
0
14Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)
Ω
4mo
Ω
0
29The Alignment Project by UK AISI
Ω
4mo
Ω
0
34Unexploitable search: blocking malicious use of free parameters
Ω
6mo
Ω
16
57An alignment safety case sketch based on debate
Ω
6mo
Ω
21
113UK AISI’s Alignment Team: Research Agenda
Ω
6mo
Ω
2
32Prospects for Alignment Automation: Interpretability Case Study
Ω
8mo
Ω
5
28Auditing LMs with counterfactual search: a tool for control and ELK
2y
6
Load More
Situational Awareness
3 years ago
(+46/-1)
Situational Awareness
3 years ago
(+143/-8)
Situational Awareness
3 years ago
(-131)
Situational Awareness
3 years ago
(+1)
Situational Awareness
3 years ago
(-181)
Situational Awareness
3 years ago
(+42/-15)
Situational Awareness
3 years ago
(+802)
Threat Models (AI)
3 years ago
(+33)
AI Risk Concrete Stories
3 years ago
(+6/-6)
AI Risk Concrete Stories
3 years ago
(+6/-51)
Load More