I don't have much of an answer for you but wanted to explicitly thank you for posting this thread, I am in a similar situation and wouldn't have thought to ask here but should have.
[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there. Epistemic status: low confidence but it seems low risk high reward to try. not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don't already see here. this probably already exists and I just don't know the name.]
1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don't give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.(obviously you still have to be sensible about the output program, don't go post the code to github or give it internet access.)
2) reward function stability. we know we might have made mistakes inputting the reward function, but we have some example test cases we're confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.