# Importance Sampling and Off-policy Reinforcement Learning Methods

Published:

In this article, I want to give some of my opinions on the off-policy methods in RL. Basically, what I want to discuss is the importance sampling and its relation to off-policy methods.

One word conclusion: whether we need importance sampling is decided by the distribution of updates.

## A Brief Recap on Off-policy Methods

Briefly, off-policy methods aim to evaluate/optimise a target policy $\pi$ based on the experience generated by behavior policy $b$.

Also, a requirement is that $\pi(a|s)>0 \rightarrow b(a|s)>0$.

## 1st Case: Importance Sampling in Off-policy Monte Carlo Methods

### 1.1. Recap MC Methods

The simplest case of applying importance sampling should be in Off-policy Monte Carlo (MC) methods. From my perspective, the core idea of MC methods is to take the returns of a state $s$ in all samples of episodes as they are sampled in a distribution of returns obtained by performing policy $\pi$ and then take the MC estimate as $v_\pi(s)$. Here, let’s consider on-policy MC methods first and let’s denote episodes as ${e_1, e_2, \dots, e_i, \dots, e_n}$ and denote the return for a state $S_t =s$ in $e_i$ as $G_t^i$ where $t$ is a time-step, then the MC estimate of $v_\pi(s)$ can be given as follow:

$p(G_t^i|e_i)$ is not a natual notation. What I want to express is actually “probabilities of snippets starting from some state $s$ and thus raise return $G_t$”. I may update the notation here later.

### 1.2. Derivation of Off-policy MC Methods

As we want to estimate the values under policy $\pi$, the $p(G_t^i)$ should follow the distribution of $\pi$, which means that: $p_\pi(G_t^i) = \prod_{k=t}^{T_i - 1} \pi(A_k|S_k)p(S_{k+1}|S_k, A_k)$, where $T_i$ is the length of $e_i$.

Ok, now, we have a big trouble: $p(S_{k+1}|S_k, A_k)$ is of course unknown. Otherwise, why do we need to do MC? We can use Dynamic Programming instead. However, fortunately, we don’t need to worry about this. Let’s first replace $\pi$ with another policy $b$, we can write $p_b(G_t^i)$ as follow:

From the above two equations, we can see that for a same snippet $S_t, S_{t+1}, \dots S_{T}$ of some episode $e_i$, its return is always $G_t^i$. However, under diffrent policies, the coresponding probabilities are quite different. Therefore, in off-policy MC methods, we of course need to adjust $p(G_t^i|e_i)$. It’s easy to give the following derivation:

Having this result, we can easily transform $v_b(s)=\mathbb{E}[G_t|s]$ to $v_\pi=\mathbb{E}[\rho_{t:T-1}G_t|S_t=s]$.

Some more details is given in page 103-105 in Richard Sutton’s book. The basic idea is how to count the trajectories that generate $G_t$.

### 1.3 Importance of Importance Sampling

Let’s have a review on why do we need importance sampling. A short answer is given as follow.

Whichever method of reinforcement learning, except dynamic programming, we always need to estimate values of states or state-action pairs. To be more specific, we always want MC estimates of $v_\pi(s)$ or $q_\pi(s,a)$. Hence, we need to sample episodes. Of course, we actually only care about the trajectories that can generate $G_t$. However, in off-policy methods, the probabilities of these trajectories under different policies are different. We thus need to use importance sampling ratio to adjust these probabilities.

To be concise, as far as we sample some episodes and use those “real” returns $G_t$ in these episodes to estimate values, we would need importance sampling to reshape the distribution of $G_t$.

## 2nd Case: Why there isn’t Importance Sampling in one-step Off-policy Temporal Difference Methods

Previously, we showed the relationship between importance sampling and off-policy MC methods. Of course, a natural and further question would be, what’s the relationship between importance sampling and off-policy TD methods? Well, it depends.

As we mentioned above, importance sampling is necessary if we use “real” returns to estimate values of states. Therefore, for off-policy TD(0) method, a.k.a. Q-learning, importance sampling is of course unnecessary, as the next action is choosed based on target policy $\pi$. Consider the Q-learning algorithm shown in Richad Sutton’s book: Although the current action $A$ is choosed based on behavior policy, i.e. “policy derived from $Q$”, the next action is choosed based on target policy, i.e. a greedy policy. Clearly, the sampled tuple we use to update our estimate is generated by performing target policy instead of behavior policy. So, it is of course unnecessary to use importance sampling to reshape it.

## 3rd Case: Importance Sampling in n-step Off-policy TD Methods

When it comes to n-step situations of off-policy TD methods, i.e. off-policy TD(n), we would again need the importance sampling. Why? Although TD methods are bootstrapping themselved, we here again use the sampled n-steps length trajectories and thus have to reshape the distribution of these trajectories. As before, the reshape weight, a.k.a. importance sampling ratio, is still:

With these weights, we can easily generalise n-step Sarsa algorithm to off-policy form:

But, wait a second, why is here n-step Sarsa? Previously, in one-step situation, we used Q-learning. Why do we use Sarsa instead in TD(n) off-policy situations?

Well, considering that in one-step situation, we only use a single tuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ to update q-values, it is of course convenient and efficient to sample a single action based on target policy.

In n-step situations, however, what we have are sampled trajectories $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, R_{t+n}, S_{t+n}, A_{t+n})$. And, note that the objective of off-policy methods is to utilise these sampled trajectories. Therefore, if we re-select $A_t, \dots, A_{t+n}$, then we’re actually using on-policy methods! Thus, if all these trajectories are generated by behavior policy, we again need to re-shape their distribution.

## 4th Case: n-step Off-policy TD methods without Importance Sampling

In that last section, we saw the importance sampling in off-policy TD(n) methods. A crazy question would be: is it possible to do n-step off-policy methods without importance sampling? The answer is, yes, we can! Am I lost my mind? No, let we see how this could happen.

In previous seciton, we update our estimate based on trajectories $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, R_{t+n}, S_{t+n}, A_{t+n})$. Notice that in every time-step, there is only one action selected. But, models are bootstrapping themselves, why can’t they estimate q-values for other actions of $S_t$? Yes, they of course can. Recall that, in Expected Sarsa algorithm, the q-values are updated as follow:

Then, with some sampled trajectories $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, R_{t+n}, S_{t+n}, A_{t+n})$, we already know rewards of a specific action for all time-steps. If we mix these known rewards into Expected Sarsa, do we still need importance sampling? Well, fortunately, not any more! We first give the overall update equations as follows:

The above algorithm is called “n-step Tree Backup Algorithm” (n-step TBA).

In Expected Sarsa, we exploit all actions related to states $(S_t, S_{t+1}, \dots, S_{t+n})$ and thus we are not sampling any more. It’s the same in n-step TBA, as we are just making some estimates more accurate with some “real” rewards generated by performing behavior policy. However, the distribution of $G_{t:t+n}$ is still decided by target policy $\pi$ instead of $b$ and we don’t need importance sampling any more. The key idea here is to improve some updates by utilising the real rewards generated by behavior policy but the whole distribution of updates are still follow target policy.

## 5th Case: A Unified Point of View

In one word, whether we need importance sampling is decided by the distribution of updates. What we care is whehter the distribution of updates follows target policy. If so, we don’t need importance sampling; if not, we of course need.

Some pictures needed here.