Importance Sampling over Discrete Structures
March 13th 2020 | Comments

Importance sampling offers a highly general method for performing inference tasks over arbitrary distributions. While its validity in Rn\mathbb{R}^n is intuitive, when considering discrete structural choices with different sets of continuous parameters associated with them, we may question if the importance-weighted approximation over their different measures is valid.

Let's define a generative model under which we sample a discrete structure kp()k \sample p(\cdot) and a corresponding set of continous parameters, θp(k)\bm{\theta} \sample p(\cdot | k) of length NkN_k, under which we generate an observation Dp(k,θ)\mathcal{D} \sample p(\cdot | k, \bm{\theta}). The full joint distribution is then

p(k,θ,D)=p(k)p(θk)p(Dk,θ) p(k,\bm{\theta},\mathcal{D}) = p(k)p(\bm{\theta} | k)p(\mathcal{D} | k, \bm{\theta})

We may wish to recover the posterior, or compute expectations of test functions of the latent parameters, but because of the intractability of the marginal

p(D)=kp(k)Θp(θk)p(Dk,θ) dθ p(\mathcal{D}) = \sum_k p(k) \int_\Theta p(\bm{\theta}|k)p(\mathcal{D}|k,\bm{\theta})\ d\bm{\theta}

it's not feasible to do so. We might then choose to use importance sampling to form an estimate of the marginal. We sample a set of NN latent parameters {(θi,ki)}iq(,)\brace{(\bm{\theta}_i,k_i)}_i \sample q(\cdot,\cdot) from some proposal distribution qq, and assign a corresponding weight

w(k,θ)=p(k)p(θk)p(Dk,θ)q(k,θ) w(k, \bm{\theta}) = \frac{p(k)p(\bm{\theta}|k)p(\mathcal{D}|k,\bm{\theta})}{q(k, \bm{\theta})}

Theorem 1. The expected value of the importance weights is the marginal likelihood.

Proof. It sufficies to show

Eq[w(k,θ)]=p(D) \E_q\bracket{w(k, \bm{\theta})} = p(\mathcal{D})
kΘw(k,θ)q(k,θ) dθ=kΘp(k)p(θk)p(Dk,θ)q(k,θ)q(k,θ) dθ=kp(k)Θp(θk)p(Dk,θ) dθ=p(D)\begin{aligned} \sum_k \int_\Theta w(k, \bm{\theta})q(k, \bm{\theta})\ d\bm{\theta} &= \sum_k \int_\Theta \frac{p(k)p(\bm{\theta}|k)p(\mathcal{D}|k,\bm{\theta})}{q(k, \bm{\theta})}q(k, \bm{\theta})\ d\bm{\theta} \\ &= \sum_k p(k) \int_\Theta p(\bm{\theta}|k)p(\mathcal{D}|k,\bm{\theta})\ d\bm{\theta} \\ &= p(\mathcal{D}) \end{aligned}

\Box

If we have an consistent estimator of the marginal likelihood, we can also show how the weights also allow us to form an consistent estimator of any test function under the posterior distribution. To do this, we introduce a quantity called the self-normalized importance sampling estimator.

μ~f=Eq[w(k,θ)f(k,θ)]Eq[w(k,θ)] \tilde{\mu}_f = \frac{\E_q\bracket{w(k, \bm{\theta})f(k, \bm{\theta})}}{\E_q\bracket{w(k, \bm{\theta})}}

Theorem 2. The self-normalized importance sampling estimator is an consistent estimator of the expected value of the test function under the posterior.

Proof. It sufficies to show

μ~f=E(k,θ)p(,D)[f(k,θ)]=p(D)E(k,θ)p(,D)[f(k,θ)] \tilde{\mu}_f = \E_{(k, \bm{\theta}) \sample p(\cdot, \cdot | \mathcal{D})}\bracket{f(k, \bm{\theta})} = p(\mathcal{D}) \cdot \E_{(k, \bm{\theta}) \sample p(\cdot, \cdot | \mathcal{D})}\bracket{f(k, \bm{\theta})}
kΘw(k,θ)q(k,θ)f(k,θ) dθ=kΘp(k)p(θk)p(Dk,θ)q(k,θ)q(k,θ) dθ=kΘp(k,θ,D)f(k,θ) dθ=p(D)kΘp(k,θD)f(k,θ) dθ\begin{aligned} \sum_k \int_\Theta w(k, \bm{\theta})q(k, \bm{\theta})f(k, \bm{\theta})\ d\bm{\theta} &= \sum_k \int_\Theta \frac{p(k)p(\bm{\theta}|k)p(\mathcal{D}|k,\bm{\theta})}{q(k, \bm{\theta})}q(k, \bm{\theta})\ d\bm{\theta} \\ &= \sum_k \int_\Theta p(k, \bm{\theta}, \mathcal{D}) f(k, \bm{\theta})\ d\bm{\theta} \\ &= p(\mathcal{D})\sum_k \int_\Theta p(k, \bm{\theta} | \mathcal{D})f(k, \bm{\theta})\ d\bm{\theta} \\ \end{aligned}

\Box

We were able to avoid a measure-theoertic analysis by exploiting the factorization of the distribution. If the continous parameters are independent given a discrete structure choice, all we need to do is consider the expectations of the relevant distributions. We can be confident that given enough samples, importance sampling will find the true posterior.