An oversimplified picture of a reinforcement-learning agent (in particular, roughly a Q-learning agent with a single state) could be as follows. A program has two numerical variables: go_left and go_right. The agent chooses to go left or right based on which of these variables is larger. Suppose that go_left is 3 and go_right is 1. The agent goes left. The environment delivers a "reward" of -4. Now go_left gets updated to 3 - 4 = -1 (which is not quite the right math for Q-learning, but ok). So now go_right > go_left, and the agent goes right.
So what you said is exactly correct: "It is just physics. What we call 'reward' and 'punishment' are just elements of a program forcing an agent to do something". And I think our animal brains do the same thing: they receive rewards that update our inclinations to take various actions. However, animal brains have lots of additional machinery that simple RL agents lack. The actions we take are influenced by a number of cognitive processes, not just the basic RL machinery. For example, if we were just following RL mechanically, we might keep eating candy for a long time without stopping, but our brains are also capable of influencing our behavior via intellectual considerations like "Too much candy is bad for my health". It's possible these intellectual thoughts lead to their own "rewards" and "punishments" that get applied to our decisions, but at least it's clear that animal brains make choices in very complicated ways compared with barebones RL programs.
You wrote: "Sentient beings do because they feel pain and pleasure. They have no choice but to care about punishment and reward." The way I imagine it (which could be wrong) is that animals are built with RL machinery (along with many other cognitive mechanisms) and are mechanically driven to care about their rewards in a similar way as a computer program does. They also have cognitive processes for interpreting what's happening to them, and this interpretive machinery labels some incoming sensations as "good" and some as "bad". If we ask ourselves why we care about not staying outside in freezing temperatures without a coat, we say "I care because being cold feels bad". That's a folk-psychology way to say "My RL machinery cares because being outside in the cold sends rewards of -5 at each time step, and taking the action of going inside changes the rewards to +1. And I have other cognitive machinery that can interpret these -5 and +1 signals as pain and pleasure and understand that they drive my behavior."
Assuming this account is correct, the main distinction between simple programs and ourselves is one of complexity -- how much additional cognitive machinery there is to influence decisions and interpret what's going on. That's the reason I argue that simple RL agents have a tiny bit of moral weight. The difference between them and us is one of degree.
Seems to me that there must be more about pain and pleasure than mere -1 and +1 signals, because there are multiple methods how to make some behavior more or less likely. Pain and pleasure is one such option, habits are another option, unconscious biases yet another. Each of them make some behavior more likely and some other behavior less likely, but feel quite differently from inside. Compared to habits and unconscious biases, pain and pleasure have some extra quality because of how they are implemented in our bodies.
The simple RL agents, unless they have the specific circuits to feel pain and pleasure, are in my opinion more analogical to the habits or unconscious biases.