I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I'd like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
LW feature request (low on the importance scale):
It would be nice to be able to refresh the TTS for a post if it has been edited. I was reading this post, and it was a bit confusing to keep track of the audio since it had been edited.
Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).
In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).
I think there is nuance about the downlift study that would be helpful to highlight:
This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.
I’ve looked into this as part of my goal of accelerating safety research and automating as much as we can. It was one of the primary things I imagined we would do when we pushed for the non-profit path. We eventually went for-profit because we expected there would not be enough money dispersed to do this, especially in a short timelines world.
I am again considering going non-profit again to pursue this goal, among others. I’ll send you and others a proposal on what I would imagine this looks like in the grander scheme.
I’ve been in AI safety for a while now and feel like I’ve formed a fairly comprehensive view of what would accelerate safety research, reduce power concentration, what it takes to automate research more safely as capabilities increase, and more.
I’ve tried to make this work as part of a for-profit, but it is incredibly hard to tackle the hard parts of the problem in that situation and since that is my intention, I’m again considering if a non-profit will have to do despite the unique difficulties that come with that.
Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.
There are at least four things that get conflated in the term “automated research”:
For AI safety, the crux of many disagreements is whether one believes that:
Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.
Relevant: https://www.lesswrong.com/posts/88xgGLnLo64AgjGco/where-are-the-ai-safety-replications
I think doing replications is great and it’s one of the areas I think automated research would be helpful soon. I replicated the Subliminal Learning paper on the day of the release because it was fairly easy to grab the paper, docs, codebases, etc to replicate quickly.
Short timelines, slow takeoff vs. Long timelines, fast takeoff
Due to chain-of-thought in the current paradigm seeming like great news for AI safety, some people seem to have the following expectations:
Short timelines: CoT reduces risks, but shorter preparation time increases the odds of catastrophe.
Long timelines: the current paradigm is not enough; therefore, CoT may stop being relevant, which may increase the odds of catastrophe. We have more time to prepare (which is good), but we may get a faster takeoff than the current paradigm makes it seem like. And therefore, discontinuous takeoff may introduce significantly more risks despite longer timelines.
So, perhaps counterintuitively for some, you could have these two groups:
1. Slow (smooth, non-discontinuous) takeoff, low p(doom), takeoff happens in the next couple of years. [People newer to AI safety seem more likely to expect this imo]
Vs.
2. Fast takeoff (discontinuous capability increase w.r.t. time), high p(doom), (actual) takeoff happens in 8-10 years. [seems more common under the MIRI / traditional AI safety researchers cluster]
I’m not saying those are the only two groups, but I think part of it speaks to how some people are feeling about the current state of progress and safety.
As a result, I think it’s pretty important to gain better clarity on whether we expect the current paradigm to scale without fundamental changes, and, if not, to understand what would come after it and how it would change the risks.
That’s not to say we shouldn’t weigh short timelines more highly due to being more immediate, but there are multiple terms to weigh here.
I agree it's true that other forums would engage with even worse norms, but I'm personally happy to keep the bar high and have a high standard for these discussions, regardless of what others do elsewhere. My hope is that we never stop striving for better, especially since, for alignment, the stakes are incredibly higher than most other domains, so we need a higher standard of frankness.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
Generally agree with this. I think in this case, I’m trying to call out safety folks to be frank with themselves and avoid the mistake of not trying to figure out if they really believe alignment is still hard or are looking for reasons to believe it is. Might not be what is happening here, but I did want to encourage critical thinking and potentially articulating it in case it is for some.
(Also, I did not mean for people to upvote it to the moon. I find that questionable too.)
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH