Thomas Kwa

Just left Vivek Hebbar's team at MIRI, now doing various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.


Catastrophic Regressional Goodhart

Wiki Contributions


Disagree. To correct the market, the yield of these bonds would have to go way up, which means the price needs to go way down, which means current TIPS holders need to sell, and/or people need to short.

Since TIPS are basically the safest asset, market participants who don't want volatility have few other options to balance riskier assets like stocks. So your pension fund would be crazy to sell TIPS, especially after the yield goes up.

And for speculators, there's no efficient way to short treasuries. If you're betting on 10 year AI timelines, why short treasuries and 2x your money when you could invest in AI stocks and get much larger returns?

Upvoted just for being an explanation of what updatelessness is and why it is sometimes good.

The main use of % loss recovered isn't to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a "can I take over the world" node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.

If you just want the bottom-line number (emphasis mine):

When asked how likely it is that COVID-19 originated from natural zoonosis, experts gave an average likelihood of 77% (median=90%).

Do you have a guess for how much stronger the strongest permanent magnets would be if we had nanotech capable of creating any crystal structure?

The first four points you raised seem to rely on prestige or social proof.

I'm trying to avoid applying my own potentially biased judgement, and it seems pretty necessary to use either my own judgement or some kind of social proof. I admit this has flaws.

But I also think that the prestige of programs like MATS makes the talent quality extremely high (though I may believe Linda on why this is okay), and that Forrest Landry's writing is probably crankery and if alignment is impossible it's likely for a totally different reason.

We also do not focus on getting participants to submit papers to highly selective journals or ML conferences (though not necessarily highly selective for quality of research with regards to preventing AI-induced extinction).

I think we just have different attitudes to this. I will note that ML conferences have other benefits, like networking, talking to experienced researchers, and getting a sense for the field (for me going to ICML and NeurIPS was very helpful), and for domains people already care about, peer review is a basic check that work is "real"-- novel, well-communicated, and meeting some minimum quality bar. Interpretability is becoming one of those domains.

It is relevant to consider the quality of research thinking coming out of the camp. If you or someone else had the time to look through some of those posts, I’m curious to get your sense.

I unfortunately don't have the time or expertise to do this, because these posts are in many different areas. One I do understand is this post because it cites mine and I know Jeremy Gillen. The quality and volume of work seem a bit lower than my post, which took 9 person-weeks and is (according to me) not quite good enough to publish or further pursue, though it may develop into a workshop paper. The soft optimization post took 24 person-weeks (assuming 4 people half-time for 12 weeks) plus some of Jeremy's time. I had no training in probability theory or statistics, although I was somewhat lucky in finding a topic that did not require it.

If you clicked through Paul’s somewhat hyperbolic comment of “the entire scientific community would probably consider this writing to be crankery” and consider my response, what are your thoughts on whether that response is reasonable or not? Ie. consider whether the response is relevant, soundly premised, and consistently reasoned.

I have no idea because I don't understand it. It reads vaguely like a summary of crankery. Possibly I would need to read Forrest Landry's work, but given that it's also difficult to read and I currently give 90%+ that it's crankery, you must understand why I don't.

Thanks, this is pretty reassuring. Mostly due to the nonpersonal details about how AISC fits in the pipeline, but also because your work at Apollo is proprietary and so it's not a bad sign it wasn't published.

I feel positively about this finally being published, but want to point out one weakness in the argument, which I also sent to Jeremy.

I don't think the goals of capable agents are well-described by combinations of pure "consequentialist" goals and fixed "deontological" constraints. For example, the AI's goals and constraints could have pointers to concepts that it refines over time, including from human feedback or other sources of feedback. This is similar to corrigible alignment in RLO but the pointer need not directly point at "human values". I think this fact has important safety implications, because goal objects robust to capabilities not present early in training are possible, and we could steer agents towards them using some future descendant of RepE.

I think this argument is not sufficient. Turnout effects of weather can flip elections that are already close, and from our limited perspective, more than 0.1% of elections are close. But the question is asking about the 2028 election in particular, which will probably not be so close.

If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?

Load More