There are APIs. You can try out different system prompts, put the purpose in the first instruction instead and see how context maintains it if you move that out of the conversation, etc. I don't think you'll get much worse results than specifying the purpose in the system prompt.
I'm a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like "Chain of Thought: The answer is A. Response: The answer is A"? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There's some toy models it would be, but not most we'd be testing with interpretability.)
If the answer is always A because the model's internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer "A because I have been tuned to say A?" The fact it was fine-tuned to say the answer doesn't accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Too much runs into the very real issue that truth is stranger. 😉
It's nice to read some realistic science fiction.
If system prompts aren't enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.
The reasons for my instinctive inclination to defend non-optional footnotes as a formatting choice can be summarized as the following: Pratchett.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there's progress being made in making such efforts more effort-intensive.)
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that's probably actually desirable from my perspective even if I am deceived at times.
All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.
Why would they not also potentially feel just as relatively intense positive valence, and have positive utility by default? Just getting an estimate that one side of the equation for their experience exists doesn't tell you about the other.