Interesting post! Do you have papers for the claims on why mixed activation functions perform worse? This is something I have thought about a little bit but not looked deeply into. Would appreciate links here? My naive thinking is that it mostly doesn't work due to difficulties of conditioning and keeping the loss landscape smooth and low curvature with different activation functions in a layer. With a single activation function, it is relatively straightforward to design an initialization that doesn't blow up -- with mixed ones it seems your space of potential numerical difficulties increases massively.

[-]bhauth3y10

papers for the claims on why mixed activation functions perform worse

No, there are no papers on that topic that I know of. There are relatively few papers that work on mixed activation functions at all. You should understand that papers that don't show at least a marginal increase on some niche benchmark tend not to get published. So, much of the work on mixed activation functions went unpublished.

But I can link to papers on testing mixed activation functions. Here's a Bachelor's thesis from 2022 that did relatively extensive testing. They did evolution of activation function sets for a particular application and got slightly better performance than ReLU/Swish.

That's an unfair comparison because activation function adaptation to a particular task can improve performance. The thesis did also compare its evolutionary search on single functions, and that approach did about as well as the mixed functions.

So far so good, but then, when the network was scaled up from VGG-HE-2 to VGG-HE-4, their evolved activation sets all got worse, while ReLU and Swish got better. Their best mixed activation set went from 80% to 10% accuracy as the network was scaled up, while the evolved single functions held up better but all became worse than Swish.

One of the issues I mentioned with mixed activation functions is specific to SGD training; there's also been some work on using them with neuroevolution.

[-]RussellThor3y20

Is this generalizable enough to have anything to say on slow vs fast takeoff? For example can you show that you will need a massive net to develop new understandings or more accuracy on existing tasks?