Finding the Posterior of Latent Variables \(\mathcal{Z}\)¶
By: Chengyi (Jeff) Chen
Exact Posterior¶
Approximate Posterior¶
Searching for the ELBO¶
Using ideas from importance sampling, assume we have another variational distribution [approximate posterior distribution to \(p({\mathcal{Z}} \mid {\mathcal{X}} ; \Theta = \theta)\)], \(q(\mathcal{Z} ; \Phi = \phi)\), where \(q(\mathcal{Z} ; \Phi = \phi) > 0\) whenever \(p({\mathcal{Z}}) = \int_{\mathbf{x} \in \mathcal{X}} p({\mathcal{X}} = x, {\mathcal{Z}} \mid {\bf \theta}) > 0\), and we rewite:
By Jensen’s Inequality, given concave function \(f(X)\) (e.g. \(\log\)), \(f\operatorname {E}\left[X\right] \geq \operatorname {E}\left[f(X)\right]\) [Variatio28:online]:
Hence, we get an Evidence Lower Bound (ELBO) (also known as the Negative Variational Free Energy) on the \(\log\) Evidence. Instead of an inequality, we can get an exact equality of the form below by deriving the ELBO from rearranging the KL Divergence from our variational distribution (approximate posterior of latent variables) \(q({\mathcal{Z}} \mid {\bf \phi})\) to our actual posterior over latent variables \(p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})\):
Derivation from \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))\):
Since \(\log p({\mathcal{x}} \mid {\bf \theta})\) is a constant, maximizing our ELBO / Negative Variational Free Energy will be equivalent to minimizing the \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))\) (0 when \(q({\mathcal{Z}} \mid {\bf \phi}) = p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})\)), making our variational approximation as close as possible to the actual posterior over latents. After this procedure, our 2 tasks will look like:
Find the MLE (\({\bf\theta}, {\bf\phi}\) are parameters) / MAP (\({\bf\theta}, {\bf\phi}\) are random variables) estimates of the model parameters \({\bf \theta_{\rm{max}}}, {\bf \phi_{\rm{max}}}\) by maximizing the ELBO:
In maximizing the ELBO, the first term, Expected Complete-data Log Likelihood, encourages the MLE / MAP estimates of the model parameters to be
Find the posterior over the latent variables \(\mathcal{Z}\), \(p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{\rm{max}} })\) [SVIPartI61:online]:
Finding the ELBO Part 1: Expectation-Maximization¶
The EM algorithm seeks to find the MLE of the evidence / marginal likelihood / incomplete-data likelihood by iteratively applying these two steps [Expectat45:online]:
Expectation step (E step): Set the approximate posterior / variational distribution \(q({\mathcal{Z}}\mid {\bf \phi}) = p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{t} })\), where \(\bf \theta_{t}\) are the previous M-step estimates of \(\bf \theta\), this way the \({\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) = 0\) and \(\log p({\mathcal{x}} \mid {\bf \theta}) = \mathcal{L}(p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta_{t}}))\). Our objective is then to
A. Calculate the posterior over latent variables \(p(\mathcal{Z} \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})\) and
B. Calculate \(Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\) (Expected Complete data Log Likelihood):
Notice that the only thing that is missing from \(Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\) compared to the ELBO is the entropy of the approximate posterior distribution \(\operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]\).
Maximization step (M step): Find the parameters that maximize \( Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\):
Finding the ELBO Part 2: Markov Chain Monte Carlo¶
Finding the ELBO Part 3: Mean-Field Approximate Variational Inference¶
Finding the ELBO Part 4: Black-Box Stochastic Variational Inference¶
Full Bayesian Inference¶
We’re now ready to discuss how MLE is performed in probabilistic machine learning. The key difference between
Specifically in Pyro, to get MLE estimates of \(\theta\), simply declare \(\theta\) as a fixed parameter using .param
in the model
and have an empty guide
(variational distribution). To get MAP estimates instead, declare \(\theta\) just like a regular latent random variable by .sample
in the model
, but in the guide
, declare \(\theta\) as being drawn from a dirac delta function.