I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.
If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.
However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.
Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.
I think it's interesting to get results on both the full observability and partial observability cases.