they abandoned simple metrics in favour of analyses in which qualitative factors play a large role, because all the metrics they evaluated failed to have good properties
Do you have more specific statements from GiveWell for this shift? I have not been able to find a clear enough argument for your claim from their website, nor from research on the EA Forum.
Also, your view on well-behaved utility functions may vary. You need to get an approximation of ideal utilitarianism, with a nice ordering of world-states by total happiness/suffering (depending on flavor) and how to get there. I think we can coordinate on some good enough approximations to be able to give. Is that well-behaved enough, or are you pointing at something stronger here?