Probabilistic Machine Learning

By: Chengyi (Jeff) Chen


Introduction

The purpose of these sets of notes is to connect ideas crossing the realms of frequentist, bayesian, probabilistic machine learning vernacular, e.g. how 1. frequentist maximum likelihood estimation is related to 2. partial bayesian maximum a posteriori and 3. full bayesian inference. I’m in no way an expert of the philosophical and practical differences between the the frequentist vs. bayesian perspective nor am I close to being good at mathematics – here’s just what I’ve gathered from my readings, subject to my own interpretation. Throughout, I’ll be drawing ideas from computer programming as well, specifically notes on Uber’s Pyro PPL. Starting from first principles, we ask: “What are we even trying to do in machine learning?”


Setup

Before we distinguish between supervised, unsupervised, semi-supervised learning, here’s the general probabilistic machine learning setting:

We are given a matrix of observed training data \(\mathbf{X}_{\text{train}} = \{ \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \ldots \mathbf{x}_N \}\) as independent samples generated from a true data distribution \(f(\mathcal{X})\), where \(\mathbf{x} \in \mathcal{X}\) (the set of observed data values).


Model

We specify a probabilistic model of the form:

(13)\[\begin{align} p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta) \\ \end{align}\]

to learn \(\mathbf{X}_{\text{train}}\) to approximate \(f(\mathcal{X})\), where \(\mathbf{z} \in \mathcal{Z}\) are a set of latent / unobserved random variables, as we make no assumptions on whether the observable dataset \(\mathbf{X}_{\text{train}}\) contains all information about the system. This joint probability over both the observed data and latent random variables as a function of the parameters \(\Theta\) is often called the complete data likelihood and is usually factorized into conditional dependencies by representing the joint probability as a directed graphical model, e.g. a Gaussian Mixture Model:

https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Bayesian-gaussian-mixture.svg/600px-Bayesian-gaussian-mixture.svg.png

Fig. 3 Gaussian Mixture Model Directed Graphical Model

(14)\[\begin{align} p(\mathcal{X} ; \Theta = \theta) &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathcal{Z} \\ \end{align}\]

is then called the incomplete data likelihood / evidence / marginal likelihood (because we marginalized out \(\mathcal{Z}\) to keep only \(\mathcal{X}\). Firstly, note that \(\Theta = \theta\) are fixed parameters (“\(;\)” is used instead of “\(\vert\)” in the conditioning of \(\theta\) to indicate that it is a “frequentist” fixed parameter and not a “bayesian” random variable). Secondly, note that we have to marginalize out the latent variables / work only with the joint probability model because that’s the only model we have access to; we have no direct access to the marginal probability \(p(\mathcal{X} ; \Theta = \theta) \) directly.


Objectives

Learning such a probabilistic model has 2 primary objectives:

Objective 1

Draw conclusions about the posterior distribution of our latent variables \(\mathcal{Z}\):

(15)\[\begin{align} p(\mathcal{Z} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta) &= \frac{p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z}; \Theta = \theta)}{\int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathbf{z}} \\ \end{align}\]

Objective 2

Make predictions for new data, which we can do with the posterior predictive distribution:

(16)\[\begin{align} p(\mathcal{X} = \mathbf{X}_{\text{test}} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta) &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{test}} \vert \mathcal{Z} = \mathbf{z}; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta) d\mathbf{z} \\ \end{align}\]