I have a guesstimate for number of parameters, but not for overall compute or dollar cost:
Each agent was trained on 8 TPUv3's, which cost about $5,000/mo according to a quick google, and which seem to produce 90 TOPS, or about 10^14 operations per second. They say each agent does about 50,000 steps per second, so that means about 2 billion operations per step. Each little game they play lasts 900 steps if I recall correctly, which is about 2 minutes of subjective time they say (I imagine they extrapolated from what happens if you run the game at a speed such that the physics simulation looks normal-speed to us). So that means about 7.5 steps per subjective second, so each agent requires about 15 billion operations per subjective second.
So... 2 billion operations per step suggests that these things are about the size of GPT-2, i.e. about the size of a rat brain? If we care about subjective time, then it seems the human brain maybe uses 10^15 FLOP per subjective second, which is about 5 OOMs more than these agents.