I think our current best implementation of IDA would neither be competitive nor scalably aligned :)
In most cases you can continuously trade off performance and cost; for that reason I usually think of them as a single metric of "competitive with X% overhead." I agree there are cases where they come apart, but I think there are pretty few examples. (Even for nuclear weapons you could ask "how much more expensive is it to run a similarly-destructive bombing campaign with conventional explosives.")
I think this works best if you consider a sequence of increments each worth +10%, rather than say accumulating 70 of those increments, because "spend 1000x more" is normally not available and so we don't have a useful handle on what a technology looks like when scaled up 1000x (and that scaleup would usually involve a bunch of changes that are hard to anticipate).
That is, if we have a sequence of technologies A0, A1, A2, ..., AN, each of which is 10% cheaper than the one before, then we may say that AN is better than A0 by N 10% steps (rather than trying to directly evaluate how many orders of magnitude you'd have to spend on A0 to compete with AN, because the process "spend a thousand times more on A0 in a not-stupid way" is actually kind of hard to imagine).
IDA is really aiming to be cost-competitive and performance-competitive, say to within overhead of 10%. That may or may not be possible, but it's the goal.
If the compute required to build and run your reward function is small relative to the compute required to train your model, then it seems like overhead is small. If you can do semi-supervised RL and only require a reward function evaluation on a minority of trajectories (e.g. because most of the work is learning about how to manipulate the environment), then you can be OK as long as the cost of running the reward function isn't too much higher.
Whether that's possible is a big open question. Whether it's date competitive depends on how fast you figure out how to do it.
I think "makes 50% of currently-skeptical people change their minds" is a high bar for a warning shot. On that definition e.g. COVID-19 will probably not be a warning shot for existential risk from pandemics. I do think it is plausible that AI warning shots won't be much better than pandemic warning shots. (On your definition it seems likely that there won't ever again be a warning shot for any existential risk.)
For a more normal bar, I expect plenty of AI systems to fail at large scales in ways that seem like "malice," and then to cover up the fact that they've failed. AI employees will embezzle funds, AI assistants will threaten and manipulate their users, AI soldiers will desert. Events like this will make it clear to most people that there is a serious problem, which plenty of people will be working on in order to make AI useful. The base rate will remain low but there will be periodic high-profile blow-ups.
I don't expect this kind of total unity of AI motivations you are imagining, where all of them want to take over the world (so that the only case where you see something frightening is a failed bid to take over the world). That seems pretty unlikely to me, though it's conceivable (maybe 10-20%?) and may be an important risk scenario. I think it's much more likely that we stamp out all of the other failures gradually, and are left with only the patient+treacherous failures, and in that case whether it's a warning shot or not depends entirely on how much people are willing to generalize.
I do think the situation in the AI community will be radically different after observing these kinds of warning shots, even if we don't observe an AI literally taking over a country.
There is a very narrow range of AI capability between "too stupid to do significant damage of the sort that would scare people" and "too smart to fail at takeover if it tried."
Why do you think this is true? Do you think it's true of humans? I think it's plausible if you require "take over a country" but not if you require e.g. "kill plenty of people" or "scare people who hear about it a lot."
(This is all focused on intent alignment warning shots. I expect there will also be other scary consequences of AI that get people's attention, but the argument in your post seemed to be just about intent alignment failures.)
Disclaimer: I don't know if this is right, I'm reasoning entirely from first principles.
If there is dispersion in R0, then there would likely be some places where the virus survives even if you take draconian measures. If you later relax those draconian measures, it will begin spreading in the larger population again at the same rate as before.
In particular, if the number of cases is currently decreasing overall most places, then soon most of the cases will be in regions or communities where containment was less successful and so the number of cases will stop decreasing.
If it's infeasible to literally stamp it out everywhere (which I've heard), then you basically want to either delay long enough to have a vaccine or have people get sick at the largest rate that the health care system can handle.
The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of "detail" but I think for now it's smaller than the fundamental uncertainty about whether anything in this vague direction will work.
It's an interesting coincidence that arbitration is the strongest thing we can falsify, and also apparently the strongest thing that can consistently apply to itself (if we allow probabilistic arbitration). Maybe not a coincidence?
It's not obvious to me that "consistent with PA" is the right standard for falsification though. It seems like simplicity considerations might lead you to adopt a stronger theory, and that this might allow for some weaker probabilistic version of falsification for things beyond arbitration. After all, how did we get induction anyway?
(Do we need induction, or could we think of falsification as being relative to some weaker theory?)
(Maybe this is just advocating for epistemic norms other than falsification though. It seems like the above move would be analogous to saying: the hypothesis that X is a halting oracle is really simple and explains the data, so we'll go with it even though it's not falsifiable.)
tl;dr: seems like you need some story for what values a group highly regards / rewards. If those are just the values that serve the group, this doesn't sound very distinct from "groups try to enforce norms which benefit the group, e.g. public goods provision" + "those norms are partially successful, though people additionally misrepresent the extent to which they e.g. contribute to public goods."
Similarly, larger countries do not have higher ODA as the public goods model predicts
Calling this the "public goods model" still seems backwards. "Larger countries have higher ODA" is a prediction of "the point of ODA is to satisfy the donor's consequentialist altruistic preferences."
The "public goods model" is an attempt to model the kind of moral norms / rhetoric / pressures / etc. that seem non-consequentialist. It suggests that such norms function in part to coordinate the provision of public goods, rather than as a direct expression of individual altruistic preferences. (Individual altruistic preferences will sometimes be why something is a public good.)
This system probably evolved to "solve" local problems like local public goods and fairness within the local community, but has been co-opted by larger-scale moral memeplexes.
I agree that there are likely to be failures of this system (viewed teleologically as a mechanism for public goods provision or conflict resolution) and that "moral norms are reliably oriented towards provide public goods" is less good than "moral norms are vaguely oriented towards providing public goods." Overall the situation seems similar to a teleological view of humans.
For example if global anti-poverty suddenly becomes much more cost effective, one doesn't vote or donate to spend more on global poverty, because the budget allocated to that faction hasn't changed.
I agree with this, but it seems orthogonal to the "public goods model," this is just about how people or groups aggregate across different values. I think it's pretty obvious in the case of imperfectly-coordinated groups (who can't make commitments to have their resource shares change as beliefs about relative efficacy change), and I think it also seems right in the case of imperfectly-internally-coordinated people.
(We have preference alteration because preference falsification is cognitively costly, and we have preference falsification because preference alteration is costly in terms of physical resources.)
Relevant links: if we can't lie to others, we will lie to ourselves, the monkey and the machine.
E.g., people overcompensate for private deviations from moral norms by putting lots of effort into public signaling including punishing norm violators and non-punishers, causing even more preference alteration and falsification by others.
I don't immediately see why this would be "compensation," it seems like public signaling of virtue would always be a good idea regardless of your private behavior. Indeed, it probably becomes a better idea as your private behavior is more virtuous (in economics you'd only call the behavior "signaling" to the extent that this is true).
As a general point, I think calling this "signaling" is kind of misleading. For example, when I follow the law, in part I'm "signaling" that I'm law-abiding, but to a significant extent I'm also just responding to incentives to follow the law which are imposed because other people want me to follow the law. That kind of thing is not normally called signaling. I think many of the places you are currently saying "virtue signaling" have significant non-signaling components.
That reminds me that another prediction your model makes is that larger countries should spend more on ODA (which BTW excludes military aid), but this is false
The consideration in this post would help explain why smaller countries spend more than you would expect on a naive view (where ODA just satisfies the impartial preferences of the voting population in a simple consequentialist way). It seems like there is some confusion here, but I still don't feel like it's very important.
I think there was an (additional?) earlier miscommunication or error regarding the "factions within someone's brain":