Understanding On-policy and Off-policy

This article discusses the difference between on-policy and off-policy methods in reinforcement learning. On-policy methods use data derived from the same policy to optimize the target policy, while off-policy methods use data from a behavior policy for optimization. The concepts are exemplified with SARSA and Q-learning.

First, let’s define our subject matter: whether it’s on-policy or off-policy, the ‘policy’ in question here refers to the target policy. The key difference between the two lies in the source of the data used to optimize the target policy. Here, ‘source’ refers to the target policy itself.

As Sutton once said:

On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

In an ideal world, we would like to head straight for the optimal policy and identify the target policy. However, in reality, merely using the currently known optimal options may lead to convergence to a local optimum. Adding some degree of exploration could slow down the learning efficiency, which is a drawback of on-policy methods. To balance exploration and exploitation, we use an Epsilon-greedy strategy.

In the case of off-policy methods, there are two types of policies. First, we generate a large amount of behavior data under a certain probability distribution (behavior policy), with the intention of exploration. Then, from this data that deviates (is off) from the optimal policy, we seek the target policy.

Therefore, on-policy methods use data that is dependent on itself to update the target policy, hence often referred to as “same policy”. On the other hand, off-policy updates are independent of the target policy itself, hence referred to as “different policy”.

Typically, this difference manifests when calculating the \(Q(s, a)\) value, where:

In on-policy, the next action \(a'\) is generated by the target policy itself.
In off-policy, the next action \(a'\) can be generated by the behavior policy.

We can better understand this through the update methods of SARSA and Q-learning:

When SARSA calculates the \(Q\) function, \(a'\) here is obtained from the current target policy.
In Q-learning when calculating the \(Q\) function, \(a'\) here stands for all actions. This equation essentially means taking out \(s'\) and all actions to calculate a \(Q\) value, then selecting the maximum value among them.