Verifying solutions is time consuming enough that I don't think this really alleviates the mentorship bottleneck. And it's quite hard to specify research problems that both capture important things and are precise enough to be a good bounty. So I'm fairly pessimistic on this resolving the issue. I personally would expect more research to happen per unit of my time mentoring MATS than putting up and judging bounties
Also, to give a bit more context on my thinking here, I currently think that it's fine for us to accept funding from Deepmind safety employees without counting towards this bucket, largely because my sense is the social coordination across the pond here is much less intense, and generally the Deepmind safety team has struck me as the most independent from the harsh financial incentives here to date
Seems true to me, though alas we don't seem likely to be getting an Anthropic level windfall any time soon :'(
Makes sense! Options like giving what we can regranting seem to work for a bunch of people but maybe they aren't willing to do it for lightcone?
But yeah, any donor giving $50K+ can afford to set up their own donor advised fund and get round this kind of issue, so you probably aren't losing out on THAT much by having this be hard.
What are the specific concerns re having too much funding come from frontier lab employees? I predict that it's much more ok to be mostly funded by a collection of 20+ somewhat disagreeable frontier lab employees who have varied takes, than to have all come from OpenPhil. It seems much less likely that the employees will coordinate to remove funding at once or to pressure you in the same way (especially if they involve people from different labs)
I think that number is way too low for anyone who OpenAI actually really cares about hiring. Though this kind of thing is very very heavy tailed
If you are in the UK, donate through NPT Transatlantic.
This recommendation will not work for most donors. The fees are a flat fee of £2,000 on any donation less than £102,000. NPT are mostly a donor advised fund service. While they do allow for single gifts, you need to be giving in the tens of thousands for it to make any sense.
The Anglo-American Charity is a better option for tax deductible donations to US charities, their fee is 4% on donations below £15K minimum fee £250 (I have not personally used them, but friends of mine have)
Given the size of the UK rationality community (including in high paying tech and finance roles), I imagine there would be interest if you could set up some more convenient way for small to medium UK based donors to donate tax deductibly
Great work! I'm excited to see red team blue team games being further invested in and scaled up. I think it's a great style of objective proxy task
Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.
I'm surprised that you seem to simultaneously be concerned that it was too easy to feel like you're making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you're making progress in mech interp and we need better metrics of whether we're making progress
In general they want to time-box and quantify basically everything?
The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?
Pragmatic problems are often the comparative advantage of frontier labs.
Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying
I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.