Frequentist Vs. Bayesian Methods

By: Chengyi (Jeff) Chen


Introduction


Maximum Likelihood Estimation

Derivation 1: KL Divergence

How to find the best p(X,Z;Θ=θ)?

To learn the p(X,Z;Θ=θ), we need to first design a measure of success – how useful our model is / how accurate are we modelling the real life true data distribution. Because we can only observe X, let’s define a “distance” measure between our incomplete data likelihood p(X;Θ=θ) (instead of complete data likelihood because we can’t observe it) and the true data distribution f(X). The smaller the “distance” between our 2 distributions the better our model approximates the true data generating process. A common “distance” measure between probability distributions is the KL Divergence (“distance” because KL Divergence is asymmetric, does not satisfy triangle inequality, DKL(P||Q)DKL(Q||P)). DKL(f(X)||p(X;Θ=θ)) measures how well p approximates f:

(17)θ=argminθΘDKL(f||p)=argminθΘxXf(X=x)logf(X=x)p(X=x;Θ=θ)dx=argminθΘExf[logf(X=x)]Exf[logp(X=x;Θ=θ)]=argminθΘH[f(X)]Exf[logp(X=x;Θ=θ)]=argmaxθΘExf[logp(X=x;Θ=θ)]argmaxθΘlimN1NxiXtrainlogp(X=xi;Θ=θ)law of large numbers=argmaxθΘxiXtrainp(X=xi;Θ=θ)log is a monotonic increasing function=argmaxθΘp(X=Xtrain;Θ=θ)i.i.d. data assumption=θMLE

We have thus arrived at Maximum Likelihood Estimation of parameters (you can read more about this derivation method here and here), a pointwise estimate of the parameters that maximizes the incomplete data likelihood (or complete data likelihood when we have no latent variables in the model).

Derivation 2: Posterior with Uniform Prior on Parameters

Why is MLE a “frequentist” inference technique?

The primary reason for why this technique is coined a “frequentist” method is because of the assumption that Θ=θ is a fixed parameter that needs to be estimated, while bayesians believe that Θ=θ should be a random variable, and hence, have a probability distribution that describes its behavior p(Θ), calling it our prior. In probabilistic programming / machine learning however, we don’t have to worry about these conflicting paradigms. To “convert” Θ into a random variable instead, we just need to move Θ into Z and as long as we have a way to model Z, more specifically p(Z|X;Θ=θ), the posterior distribution of our latent variables, we are good.

Can we simply find the θ that maximizes p(X=Xtrain;Θ=θ)?

Unfortunately, because our model is specified with the latent variables Z, we can’t directly maximize p(X=Xtrain;Θ=θ). We’ll have to marginalize out the latent variables first as follows:

(18)p(X=Xtrain;Θ=θ)=zZp(X=Xtrain,Z=z;Θ=θ)dz=zZp(X=Xtrain|Z=z;Θ=θ)p(Z=z;Θ=θ)dz

and hence, Maximum Likelihood Estimation becomes:

(19)θ=argmaxθΘzZp(X=Xtrain|Z=z;Θ=θ)p(Z=z;Θ=θ)dz

However, this marginalization is often intractable (e.g. if Z is a sequence of events, so that the number of values grows exponentially with the sequence length, the exact calculation of the integral will be extremely difficult). Let’s instead try to find a lower bound for it by expanding it.


Maximum A Posteriori

Derivation 1: Computationally Inconvienient to calculate the full Posterior p(Θ|X=Xtrain)

Before continuing, realize that because

(20)p(X,Z;Θ=θ)=p(Z|X;Θ=θ)p(X;Θ=θ)
(21)p(Z|X;Θ=θ)=
(22)=argmaxθΘ1NxxtrainzZlogp(X=x,Z=z;Θ=θ)dz

Note

Mathematical Notation

The math notation of my content, including the ones in this post follow the conventions in Christopher M. Bishop’s Pattern Recognition and Machine Learning. In addition, I use caligraphic capitalized roman and capitalized greek symbols like X,Y,Z,Ω,Ψ,Ξ, to represent BOTH a set of values that the random variables can take as well as the argument of a function in python (e.g. def p(Θ=θ)).

https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation

Objective:

(23)

Derivation 2:

Parameter Uncertainty

Frequentist: Uncertainty is estimated with confidence intervals

Bayesian: Uncertainty is estimated with credible intervals

Prediction Intervals


Empircal Bayes; Type II Maximum Likelihood Estimation


Hierarchical Bayes