What is the difference between Q-learning and SARSA?

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q-Learning: Q(s_t+1,a_t+1) = max_aQ(s_t+1,a)

SARSA: Q(s_t+1,a_t+1) = ε·mean_aQ(s_t+1,a) + (1-ε)·max_aQ(s_t+1,a)

Leave a Comment Cancel reply