Frequentist Vs. Bayesian Methods

By: Chengyi (Jeff) Chen


Introduction


Maximum Likelihood Estimation

Derivation 1: KL Divergence

How to find the best \(p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)\)?

To learn the \(p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)\), we need to first design a measure of success – how useful our model is / how accurate are we modelling the real life true data distribution. Because we can only observe \(\mathcal{X}\), let’s define a “distance” measure between our incomplete data likelihood \(p(\mathcal{X} ; \Theta = \theta)\) (instead of complete data likelihood because we can’t observe it) and the true data distribution \(f(\mathcal{X})\). The smaller the “distance” between our 2 distributions the better our model approximates the true data generating process. A common “distance” measure between probability distributions is the KL Divergence (“distance” because KL Divergence is asymmetric, does not satisfy triangle inequality, \(D_{KL}(P \vert\vert Q) \not= D_{KL}(Q \vert\vert P)\)). \(D_{KL}(f(\mathcal{X}) \vert \vert p(\mathcal{X};\Theta=\theta))\) measures how well \(p\) approximates \(f\):

(17)\[\begin{align} \theta^* &= \arg\underset{\theta \in \Theta}{\min} D_{KL}(f \vert \vert p) \\ &= \arg\underset{\theta \in \Theta}{\min}\int_{\mathbf{x} \in \mathcal{X}} f(\mathcal{X}=\mathbf{x}) \log \frac{f(\mathcal{X}=\mathbf{x})}{p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)} d\mathbf{x} \\ &= \arg\underset{\theta \in \Theta}{\min}\mathbb{E}_{\mathbf{x} \sim f} [\log f(\mathcal{X}=\mathbf{x})] - \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\ &= \arg\underset{\theta \in \Theta}{\min}-\mathbb{H}[f(\mathcal{X})] - \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\ &= \arg\underset{\theta \in \Theta}{\max} \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\ &\approx \arg\underset{\theta \in \Theta}{\max} \lim_{N \rightarrow \infty} \frac{1}{N}\sum_{\mathbf{x}_i \in \mathbf{X}_{\text{train}}} \log p(\mathcal{X}=\mathbf{x}_i ; \Theta = \theta) \because \text{law of large numbers} \\ &= \arg\underset{\theta \in \Theta}{\max} \prod_{\mathbf{x}_i \in \mathbf{X}_{\text{train}}} p(\mathcal{X}=\mathbf{x}_i ; \Theta = \theta) \because \log\text{ is a monotonic increasing function} \\ &= \arg\underset{\theta \in \Theta}{\max} p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta) \because \text{i.i.d. data assumption} \\ &= \theta_{\text{MLE}} \end{align}\]

We have thus arrived at Maximum Likelihood Estimation of parameters (you can read more about this derivation method here and here), a pointwise estimate of the parameters that maximizes the incomplete data likelihood (or complete data likelihood when we have no latent variables in the model).

Derivation 2: Posterior with Uniform Prior on Parameters

Why is MLE a “frequentist” inference technique?

The primary reason for why this technique is coined a “frequentist” method is because of the assumption that \(\Theta = \theta\) is a fixed parameter that needs to be estimated, while bayesians believe that \(\Theta = \theta\) should be a random variable, and hence, have a probability distribution that describes its behavior \(p(\Theta)\), calling it our prior. In probabilistic programming / machine learning however, we don’t have to worry about these conflicting paradigms. To “convert” \(\Theta\) into a random variable instead, we just need to move \(\Theta\) into \(\mathcal{Z}\) and as long as we have a way to model \(\mathcal{Z}\), more specifically \(p(\mathcal{Z} \vert \mathcal{X} ; \Theta = \theta)\), the posterior distribution of our latent variables, we are good.

Can we simply find the \(\theta\) that maximizes \(p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)\)?

Unfortunately, because our model is specified with the latent variables \(\mathcal{Z}\), we can’t directly maximize \(p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)\). We’ll have to marginalize out the latent variables first as follows:

(18)\[\begin{align} p(\mathcal{X} = \mathbf{X}_{\text{train}} ; \Theta = \theta) &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathbf{z} \\ &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}} \vert \mathcal{Z} = \mathbf{z} ; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} ; \Theta = \theta) d\mathbf{z} \\ \end{align}\]

and hence, Maximum Likelihood Estimation becomes:

(19)\[\begin{align} \theta^* &= \arg\underset{\theta \in \Theta}{\max} \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}} \vert \mathcal{Z} = \mathbf{z} ; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} ; \Theta = \theta) d\mathbf{z} \\ \end{align}\]

However, this marginalization is often intractable (e.g. if \(\mathcal{Z}\) is a sequence of events, so that the number of values grows exponentially with the sequence length, the exact calculation of the integral will be extremely difficult). Let’s instead try to find a lower bound for it by expanding it.


Maximum A Posteriori

Derivation 1: Computationally Inconvienient to calculate the full Posterior \(p(\mathcal{\Theta} \vert \mathcal{X} = \mathbf{X}_{\text{train}})\)

Before continuing, realize that because

(20)\[\begin{align} p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta) &= p(\mathcal{Z} \vert \mathcal{X}; \Theta = \theta) p(\mathcal{X} ; \Theta = \theta) \end{align}\]
(21)\[\begin{align} p(\mathcal{Z} \vert \mathcal{X}; \Theta = \theta) &= \frac{}{} \end{align}\]
(22)\[\begin{align} &= \arg\underset{\theta \in \Theta}{\max} \frac{1}{N}\sum_{\mathbf{x} \in \mathbf{x}_{\text{train}}} \int_{\mathbf{z} \in \mathcal{Z}} \log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z}; \Theta = \theta) d\mathbf{z} \\ \end{align}\]

Note

Mathematical Notation

The math notation of my content, including the ones in this post follow the conventions in Christopher M. Bishop’s Pattern Recognition and Machine Learning. In addition, I use caligraphic capitalized roman and capitalized greek symbols like \(\mathcal{X}, \mathcal{Y}, \mathcal{Z}, \Omega, \Psi, \Xi, \ldots\) to represent BOTH a set of values that the random variables can take as well as the argument of a function in python (e.g. def p(Θ=θ)).

https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation

Objective:

(23)\[\begin{align} \end{align}\]

Derivation 2:

Parameter Uncertainty

Frequentist: Uncertainty is estimated with confidence intervals

Bayesian: Uncertainty is estimated with credible intervals

Prediction Intervals


Empircal Bayes; Type II Maximum Likelihood Estimation


Hierarchical Bayes