Reinforcement learning

Edited by pedrochaves, TurnTrout, Roman Leventov, et al. last updated 30th Dec 2024

Reinforcement Learning is the study of how to train agents to complete tasks by updating ("reinforcing") the agents with feedback signals.

Consider an agent that receives an input informing the agent of the environment's state. Based only on that information, the agent has to make a decision regarding which action to take, from a set, which will influence the state of the environment. This action will in itself change the state of the environment, which will result in a new input, and so on, each time also presenting the agent with the reward (or reinforcement signal) relative to its actions in the environment. In "policy gradient" approaches, the reinforcement signal is often used to update the agent (the "policy"), although sometimes an agent will do limited online (model-based) heuristic search to instead optimize the reward signal + heuristic evaluation.

RL is distinguished from energy-based architectures such as Active Inference, Joint Embedded Predictive Architectures (JEPA), and GFlowNets.

Exploration and Optimization

Knowing that randomly selecting the actions will result in poor performances, one of the biggest problems in reinforcement learning is exploring the avaliable set of responses to avoid getting stuck in sub-optimal choices and proceed to better ones.

This is the problem of exploration, which is best described in the most studied reinforcement learning problem - the k-armed bandit. In it, an agent has to decide which sequence of levers to pull in a gambling room, not having any information about the probabilities of winning in each machine besides the reward it receives each time. The problem revolves about deciding which is the optimal lever and what criteria defines the lever as such.

Parallel with an exploration implementation, it is still necessary to chose the criteria which makes a certain action optimal when compared to another. This study of this property has led to several methods, from brute forcing to taking into account temporal differences in the received reward. Despite this and the great results obtained by reinforcement methods in solving small problems, it suffers from a lack of scalability, having difficulties solving larger, close-to-human scenarios.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Reinforcement learning

Exploration and Optimization

Further Reading & References

See Also

Reinforcement learning

Exploration and Optimization

Further Reading & References

See Also