1 minute read


In this article, I give a brief summary of the so called “REINFORCE” trick.


Assume that we have the following items:

  1. a random variable $x\sim p(x;\theta)$ where $p$ is a parametric distribution modelled by $\theta$;

  2. a function of $x$ which is denoted as $f(x)$.

The quantity we wish to calculate is:

For example, $x$ and $f(x)$ could be:

  1. trajectories and reward function in Reinforcement Learning;
  2. variational distribution and log-likelihood of joint distribution of observable and latent variables.

Core Transformation


The problem of our objective quantity ($\nabla_\theta \mathbb{E}_{p(x;\theta)}\left[ f(x) \right]$) is that the parameter $\theta$ that need to be optimised doesn’t exist in $f(x)$, thus we can’t get derivative of it with samples drawn from $p(x;\theta)$.

Luckily, we could introduce the following REINFORCE trick to address the problem:

$\nabla_\theta \mathbb{E}_{p(x;\theta)}\left[f(x) \right] = \nabla_\theta \int f(x)p(x;\theta)dx$

$ = \int f(x) \nabla_\theta p(x;\theta) dx $

$ = \int f(x) p(x;\theta) \nabla_\theta \log p(x;\theta) dx $

$= \mathbb{E}_{p(x;\theta)} \left[ f(x) \nabla_\theta \log p(x;\theta) \right]$

Then, with samples from $p(x; \theta)$, we could approximate the expectation by Monte Carlo method, i.e.

Now, we could update $\theta$ by calculating $\nabla_\theta \log p(x_i; \theta)$ based on the samples $x_i$.

Leave a Comment

Your email address will not be published. Required fields are marked *