CBiddulph

Wiki Contributions

Comments

Sorted by

Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")

I think it'd be good to put a visible deadline on the fundraising thermometer, now-ish or maybe just in the final week of the fundraiser. It conveys some urgency, like "oh shoot are we going to make the $X funding goal in time? I better donate some money to make that happen."

Multiple deadlines ($2M by date Y, $3M by date Z) might be even better

Answer by CBiddulph41

Anthropic's computer use model and Google's Deep Research both do this. Training systems like this to work reliably has been a bottleneck to releasing them

Strong-upvoted just because I didn't know about the countdown trick and it seems useful

CBiddulph5-5

I really like your vision for interpretability!

I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).

But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.

While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).

Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.

In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math

Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.

I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)

I decided to apply, and now I'm wondering what the best schools are for AI safety.

After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:

  • UC Berkeley (top choice)
  • CMU
  • Georgia Tech
  • University of Washington
  • University of Toronto
  • Cornell
  • University of Illinois - Urbana-Champaign
  • University of Oxford
  • University of Cambridge
  • Imperial College London
  • UT Austin
  • UC San Diego

I'll probably cut this list down significantly after researching the schools' faculty and their relevance to AI safety, especially for schools lower on this list.

I might also consider the CDTs in the UK mentioned in Stephen McAleese's comment. But I live in the U.S. and am hesitant about moving abroad - maybe this would involve some big logistical tradeoffs even if the school itself is good.

Anything big I missed? (Unfortunately, the Stanford deadline is tomorrow and the MIT deadline was yesterday, so those aren't gonna happen.) Or, any schools that seem obviously worse than continuing to work as a SWE at a Big Tech company in the Bay Area? (I think the fact that I live near Berkeley is a nontrivial advantage for me, career-wise.)

Load More