"I have heard that they get the details wrong though, and the fact that they [Groq] are still adversing their ResNet-50 performance (a 2015 era network) speaks to that."
I'm not sure I fully get this criticism: ResNet-50 is the most standard image recognition benchmark and unsurprisingly it's the only (?) architecture that NVIDIA lists in their benchmarking stats for image recognition as well: https://developer.nvidia.com/deep-learning-performance-training-inference.
You are of course aware that Xilinx has its own flavour of ML stuff that can be pushed onto its FPGA's. I believe it is mostly geared towards inference, but have you considered checking the plausibility of your 'as good as a 3090' estimate against the published performance numbers of the first party solutions?