Interesting. Can you give us a sense of how much those asks (offer to pay for the extra labour) end up costing you?
I always assumed that "Why don't we give Terence Tao a million dollars to work on AGI alignment?" was using Tao to refer to a class of people. Your comment implies that it would be especially valuable for Tao specifically to work on it.
Why should we believe that Tao would be especially likely to be able to make progress on AGI alignment (e.g. compared to other recent fields medal winners like Peter Scholze)?
[I think this is more anthropomorphizing ramble than concise arguments. Feel free to ignore :) ]
I get the impression that in this example the AGI would not actually be satisficing. It is no longer maximizing a goal but still optimizing for this rule.
For a satisficing AGI, I'd imagine something vague like "Get many paperclips" resulting in the AGI trying to get paperclips but at some point (an inflection point of diminishing marginal returns? some point where it becomes very uncertain about what the next action should be?) doing something else.
Or for rules like "get 100 paperclips, not more" the AGI might only directionally or opportunistically adhere. Within the rule, this might look like "I wanted to get 100 paperclips, but 98 paperclips are still better than 90, let's move on" or "Oops, I accidentally got 101 paperclips. Too bad, let's move on".
In your example of the AGI taking lots of precautions, the satisficing AGI would not do this because it could be spending its time doing something else.
I suspect there are major flaws with it, but an intuition I have goes something like this:
There has been quite a lot of discussion over on the EA Forum:
https://forum.effectivealtruism.org/search?terms=phil%20torres
Avital Balwit linked to this lesswrong post in the comments of her own response to his longtermism critique (because Phil Torres is currently banned from the forum, afaik):
https://forum.effectivealtruism.org/posts/kageSSDLSMpuwkPKK/response-to-recent-criticisms-of-longtermism-1#6ZzPqhcBAELDiAJhw
The whole thing was much more banal than what you're imagining. It was an interim-use building with mainly student residents. There was no coordination between residents that I knew of.
The garden wasn't trashed before the letter. It was just a table and a couple of chairs, that didn't fit the house rules. If the city had just said "please, take the table out of the garden", I'd have given a 70% chance of it working. If the city had not said a thing, there would not have been (a lot of) additional furniture in the garden.
By issuing the threat, the city introduced an incentive they didn't intend.
Some residents who picked up on the incentive destroyed the garden because they were overconfident in the authority following through with the threat – no matter what.
How useful do people think it would be to have human-in-the-loop (HITL) AI systems?
What's the likeliest failure mode of HITL AI Systems? And what kind of policies are mostly likely to help against those failure modes?
If we assume that HITL AI Systems are not going to be safer once we reach unaligned AGI. Could it maybe give use tools/information/time-steps to increase the chance that AGI will be aligned?