Frequentist Vs. Bayesian Methods¶
By: Chengyi (Jeff) Chen
Introduction¶
Maximum Likelihood Estimation¶
Derivation 1: KL Divergence¶
How to find the best \(p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)\)?¶
To learn the \(p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)\), we need to first design a measure of success – how useful our model is / how accurate are we modelling the real life true data distribution. Because we can only observe \(\mathcal{X}\), let’s define a “distance” measure between our incomplete data likelihood \(p(\mathcal{X} ; \Theta = \theta)\) (instead of complete data likelihood because we can’t observe it) and the true data distribution \(f(\mathcal{X})\). The smaller the “distance” between our 2 distributions the better our model approximates the true data generating process. A common “distance” measure between probability distributions is the KL Divergence (“distance” because KL Divergence is asymmetric, does not satisfy triangle inequality, \(D_{KL}(P \vert\vert Q) \not= D_{KL}(Q \vert\vert P)\)). \(D_{KL}(f(\mathcal{X}) \vert \vert p(\mathcal{X};\Theta=\theta))\) measures how well \(p\) approximates \(f\):
We have thus arrived at Maximum Likelihood Estimation of parameters (you can read more about this derivation method here and here), a pointwise estimate of the parameters that maximizes the incomplete data likelihood (or complete data likelihood when we have no latent variables in the model).
Derivation 2: Posterior with Uniform Prior on Parameters¶
Why is MLE a “frequentist” inference technique?¶
The primary reason for why this technique is coined a “frequentist” method is because of the assumption that \(\Theta = \theta\) is a fixed parameter that needs to be estimated, while bayesians believe that \(\Theta = \theta\) should be a random variable, and hence, have a probability distribution that describes its behavior \(p(\Theta)\), calling it our prior. In probabilistic programming / machine learning however, we don’t have to worry about these conflicting paradigms. To “convert” \(\Theta\) into a random variable instead, we just need to move \(\Theta\) into \(\mathcal{Z}\) and as long as we have a way to model \(\mathcal{Z}\), more specifically \(p(\mathcal{Z} \vert \mathcal{X} ; \Theta = \theta)\), the posterior distribution of our latent variables, we are good.
Can we simply find the \(\theta\) that maximizes \(p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)\)?¶
Unfortunately, because our model is specified with the latent variables \(\mathcal{Z}\), we can’t directly maximize \(p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)\). We’ll have to marginalize out the latent variables first as follows:
and hence, Maximum Likelihood Estimation becomes:
However, this marginalization is often intractable (e.g. if \(\mathcal{Z}\) is a sequence of events, so that the number of values grows exponentially with the sequence length, the exact calculation of the integral will be extremely difficult). Let’s instead try to find a lower bound for it by expanding it.
Maximum A Posteriori¶
Derivation 1: Computationally Inconvienient to calculate the full Posterior \(p(\mathcal{\Theta} \vert \mathcal{X} = \mathbf{X}_{\text{train}})\)¶
Before continuing, realize that because
Note
Mathematical Notation
The math notation of my content, including the ones in this post follow the conventions in Christopher M. Bishop’s Pattern Recognition and Machine Learning. In addition, I use caligraphic capitalized roman and capitalized greek symbols like \(\mathcal{X}, \mathcal{Y}, \mathcal{Z}, \Omega, \Psi, \Xi, \ldots\) to represent BOTH a set of values that the random variables can take as well as the argument of a function in python (e.g. def p(Θ=θ)
).
https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation
Objective:
Derivation 2:¶
Parameter Uncertainty¶
Frequentist: Uncertainty is estimated with confidence intervals
Bayesian: Uncertainty is estimated with credible intervals