Finding the Posterior of Latent Variables \(\mathcal{Z}\)

By: Chengyi (Jeff) Chen


Exact Posterior


Approximate Posterior

Searching for the ELBO

Using ideas from importance sampling, assume we have another variational distribution [approximate posterior distribution to \(p({\mathcal{Z}} \mid {\mathcal{X}} ; \Theta = \theta)\)], \(q(\mathcal{Z} ; \Phi = \phi)\), where \(q(\mathcal{Z} ; \Phi = \phi) > 0\) whenever \(p({\mathcal{Z}}) = \int_{\mathbf{x} \in \mathcal{X}} p({\mathcal{X}} = x, {\mathcal{Z}} \mid {\bf \theta}) > 0\), and we rewite:

(24)\[\begin{align} \log p(\mathcal{X} \mid \boldsymbol{\theta }) &= \log \sum_{z \in \mathcal{Z}} p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \frac{q({\mathcal{Z} = z} \mid {\bf \phi})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \\ &= \log \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \right] \\ \end{align}\]

By Jensen’s Inequality, given concave function \(f(X)\) (e.g. \(\log\)), \(f\operatorname {E}\left[X\right] \geq \operatorname {E}\left[f(X)\right]\) [Variatio28:online]:

(25)\[\begin{align} \log p(\mathcal{X} \mid \boldsymbol{\theta }) &= \log \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \right] \\ &\geq \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log\left(\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})}\right)\right] \\ &= \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) - \log q({\mathcal{Z} = z} \mid {\bf \phi})\right] \\ &= \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] - \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] \\ &= \underbrace{\underbrace{\operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right]}_{\text{Expected Complete-data Log Likelihood}} + \underbrace{\operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]}_{\text{Entropy of Variational Dist.}}}_{\text{ELBO / Negative Variational Free Energy } \mathcal{L}(q({\mathcal{Z}}\mid {\bf \phi}))} \\ \end{align}\]

Hence, we get an Evidence Lower Bound (ELBO) (also known as the Negative Variational Free Energy) on the \(\log\) Evidence. Instead of an inequality, we can get an exact equality of the form below by deriving the ELBO from rearranging the KL Divergence from our variational distribution (approximate posterior of latent variables) \(q({\mathcal{Z}} \mid {\bf \phi})\) to our actual posterior over latent variables \(p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})\):

Derivation from \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))\):

(26)\[\begin{align} {\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log\left(\frac{q({\mathcal{Z} = z} \mid {\bf \phi})}{p({\mathcal{Z} = z} \mid {\mathcal{x}}, {\bf \theta})}\right)\right] \\ &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] - \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{Z} = z} \mid {\mathcal{x}}, {\bf \theta})\right] \\ &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] - \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{Z} = z}, {\mathcal{x}} \mid {\bf \theta})\right] + \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{x}} \mid {\bf \theta})\right] \\ &= -\left[\operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] + \operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]\right] + \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{x}} \mid {\bf \theta})\right] \\ &= -\mathcal{L}(q({\mathcal{Z} = z}\mid {\bf \phi})) + \log p({\mathcal{x}} \mid {\bf \theta}) \because \text{Expectation is over latent variables }{\mathcal{Z} = z}\text{, which is independent of }{\mathcal{x}} \\ \end{align}\]
(27)\[\begin{align} \therefore \log p({\mathcal{x}} \mid {\bf \theta}) &= \mathcal{L}(q({\mathcal{Z}} \mid {\bf \phi})) + {\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) \\ \end{align}\]

Since \(\log p({\mathcal{x}} \mid {\bf \theta})\) is a constant, maximizing our ELBO / Negative Variational Free Energy will be equivalent to minimizing the \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))\) (0 when \(q({\mathcal{Z}} \mid {\bf \phi}) = p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})\)), making our variational approximation as close as possible to the actual posterior over latents. After this procedure, our 2 tasks will look like:

    1. Find the MLE (\({\bf\theta}, {\bf\phi}\) are parameters) / MAP (\({\bf\theta}, {\bf\phi}\) are random variables) estimates of the model parameters \({\bf \theta_{\rm{max}}}, {\bf \phi_{\rm{max}}}\) by maximizing the ELBO:

(28)\[\begin{align} {\bf\theta_{\rm{max}}} &= \underset{\boldsymbol {\theta}}{\operatorname{argmax}} \log p(\mathcal{X} \mid \boldsymbol{\theta }) \\ {\bf\theta_{\rm{max}}}, {\bf\phi_{\rm{max}}} &\approx \underset{{\bf \theta}, {\bf \phi}}{\operatorname{argmax}} \mathcal{L}(q({\mathcal{Z}} \mid {\bf \phi})) \\ &= \underset{{\bf \theta}, {\bf \phi}}{\operatorname{argmax}} \operatorname {E}_{q({\mathcal{Z}} = z \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] - \operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right] \\ \end{align}\]

In maximizing the ELBO, the first term, Expected Complete-data Log Likelihood, encourages the MLE / MAP estimates of the model parameters to be

    1. Find the posterior over the latent variables \(\mathcal{Z}\), \(p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{\rm{max}} })\) [SVIPartI61:online]:

(29)\[\begin{align} p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{\rm{max}} }) &\approx q({\mathcal{Z}} \mid {\bf \phi}) \\ \end{align}\]

Finding the ELBO Part 1: Expectation-Maximization

The EM algorithm seeks to find the MLE of the evidence / marginal likelihood / incomplete-data likelihood by iteratively applying these two steps [Expectat45:online]:

    1. Expectation step (E step): Set the approximate posterior / variational distribution \(q({\mathcal{Z}}\mid {\bf \phi}) = p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{t} })\), where \(\bf \theta_{t}\) are the previous M-step estimates of \(\bf \theta\), this way the \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) = 0\) and \(\log p({\mathcal{x}} \mid {\bf \theta}) = \mathcal{L}(p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta_{t}}))\). Our objective is then to

    • A. Calculate the posterior over latent variables \(p(\mathcal{Z} \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})\) and

    • B. Calculate \(Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\) (Expected Complete data Log Likelihood):

(30)\[\begin{align} Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)}) &= \operatorname {E} _{p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log L({\boldsymbol {\theta }};\mathcal{X} ,\mathcal{Z} = z )\right]\, \\ &= \operatorname {E} _{p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \right]\, \\ &= \sum_{z \in \mathcal{Z}} p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)}) \log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \\ \end{align}\]

Notice that the only thing that is missing from \(Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\) compared to the ELBO is the entropy of the approximate posterior distribution \(\operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]\).

    1. Maximization step (M step): Find the parameters that maximize \( Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\):

(31)\[\begin{align} {\boldsymbol {\theta }}^{(t+1)} &= {\underset {\boldsymbol {\theta }}{\operatorname {arg\,max} }}\ Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\, \end{align}\]

Finding the ELBO Part 2: Markov Chain Monte Carlo

Finding the ELBO Part 3: Mean-Field Approximate Variational Inference

Finding the ELBO Part 4: Black-Box Stochastic Variational Inference


Full Bayesian Inference

We’re now ready to discuss how MLE is performed in probabilistic machine learning. The key difference between

Specifically in Pyro, to get MLE estimates of \(\theta\), simply declare \(\theta\) as a fixed parameter using .param in the model and have an empty guide (variational distribution). To get MAP estimates instead, declare \(\theta\) just like a regular latent random variable by .sample in the model, but in the guide, declare \(\theta\) as being drawn from a dirac delta function.