How familiar is the Lesswrong community as a whole with the concept of Reward-modelling? — LessWrong

x

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling? — LessWrong

8 comments, sorted by

Click to highlight new comments since: Today at 7:06 AM

The words don't ring a bell. You don't provide any explanation or reference, so I am unable to tell whether I am unfamiliar with the concept, or just know it under a different name (or no name at all).

Thank you so much for the reply. You prevented me from making a pretty big mistake.

I'm defining reward-modelling as the manipulation of the direction of an agent's intelligence. From a goal-directed perspective.

So the reward-modelling of an AI might be the weights used, its training environment, mesa-optimization structure, inner-alignment structure, etc.

Or for a human, it might be genetics, pleasure, and pain.

Is there a better word I can use for this concept? Or maybe I should just make up a word?

I approximately see the context of your question, but I am not sure what exactly are you talking about. Maybe please try less abstract, more ELI5, with specific examples what you mean (and the adjacent concepts that you don't mean)?

Is it about which forces direct agent's attention in short term? Like, a human would do X, because we have an instinct to do X, or because of a previous experience that doing X leads to pleasure, either immediately or in longer term. And avoid Y, because of innate aversion, or a previous experience that Y causes pain.

Seems to me that "genetics" is a different level of abstraction than "pleasure and pain". If I try to disentangle this, it seems to me that humans

immediately act on a stimulus (including internal, such as "I just remembered that...")
that is either a hardwired instinct, or learned i.e. a reaction stored in memory
the memory is updated by things causing pleasant or painful experience (again, including internal experience, e.g. hearing something makes me feel bad, even if the stimulus itself is not painful)
both the instincts and the organization of memory are determined by the genes
which are formed by evolution.

Do you want a similar analysis for LLMs? Do you want to attempt to make a general analysis even for hypothetical AIs based on different principles?

Is the goal to know all the levels of "where we can intervene"? Something like: "we can train the AI, we can upvote or downvote its answers, we can directly edit its memory..."?

(I am not an expert on LLMs, so I can't tell you more than the previous paragraph contains. I am just trying to figure out what is the thing you are interested in. It seems to me that people already study the individual parts of that, but... are you looking for some kind of more general approach?

These are 6 sample titles I'm considering using. Any thoughts come to mind?

AI-like reward functioning in humans. (Comprehensive model)
Agency in humans
Agency in humans | comprehensive model of why humans do what they do
EA should focus less on AI alignment, more on human alignment
EA's AI focus will be the end of us all.
EA's AI alignment focus will be the end of us all. We should focus on human alignment instead

I'd say that the 80/20 of the concept is how reward & punishment affect human behavior.

Is it about which forces?
- I would say I'm referring to a combination of instinct, innate attraction/aversion, previous experience, decision-making, attention, and how they relate to each other in an everyday practical context.

Seems to me that "genetics"
- I would say your disentanglement is right on the money. Rather than making an analysis for LLMs, I'm particularly interested in fleshing out the inter relations between concepts as they relate to the human brain.

Do you want a similar analysis for LLMs?
I mean it from a high-level agency perspective, as opposed to in specific AI or machine learning contexts.

Goal?
My goal is to learn more about what information Lesswrongers use and are interested in so that I can better create a post for the community.

Adjacent concepts

Self-discipline
Positive psychology
Systems & patterns thinking
Maybe reward-functions?

[-]faul_sname1y80

Can you give one extremely concrete example of a scenario which involves reward modeling, and point to the part of the scenario that you call "reward modeling"?

It should be a different word to avoid confusion with reward models (standard terminology for models used to predict the reward in some ML contexts)

Thanks for this. Do you have any ideas of what terminology i should use if I mean models used to predict reward in human contexts?