Uplift Modelling and Contextual Bandits [In Progress]

By: Chengyi (Jeff) Chen

%load_ext autotime
%load_ext nb_black

import causalml
import obp

Introduction

Having worked with uplift modelling at Shopee and contextual bandits at Gojek, I’ve come to realize huge similarities between the way these two seemingly disparate problems are modelled. In the following post, I’ll do my best to reconcile the relationship between these two modelling approaches and show how uplift modelling is simply a subclass of contextual bandits. I’ll be going through uplift modelling with the help of Uber’s Causal ML package and contextual bandits with Open Bandit Pipeline. In particular, I’m interested in comparing the different algorithms used for off-policy evaluation


Preliminaries

What is uplift modelling?

What are contextual bandits?

2. Contextual Bandit

3. Full Reinforcement Learning

https://stats.stackexchange.com/questions/89396/multi-armed-bandit-algorithms-vs-uplift-modeling


15.2. Off-Policy Confidence Intervals

15.2.1 T-distribution (Population Variance Unknown)

def student_t_bound(**kwargs):
    """Calculates the Student T's confidence interval for the V_ips estimator
    If V_ips = 1/nΣ_nX = μ, then var[V_ips] = sample variance / n = (1/(n - 1)Σ_n(X - μ)^2) / n

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower bound of V(π), upper bound of V(π))

    """
    t_score = sp.stats.t.ppf(1 - kwargs["δ"] / 2)

    lb = min(1, max(0, kwargs["V_ips"] - t_score*(kwargs["var_V_ips"]**0.5)))
    ub = min(1, max(0, kwargs["V_ips"] + t_score*(kwargs["var_V_ips"]**0.5)))

    return min(lb, ub), max(lb, ub)

15.2.2 Asymptotically Gaussian (Central Limit Theorem)

def asymptotic_gaussian_bound(**kwargs):
    """Calculates the asymptotically Gaussian confidence interval for the V_ips estimator
    As n --> \infty, by CLT, sampling distribution of the sample mean of the random variable,
    in our case: importance-weighted rewards are our random variable α.
    If V_ips = 1/nΣ_nX = μ, then var[V_ips] = sample variance / n = (1/(n - 1)Σ_n(X - μ)^2) / n
    
    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower bound of V(π), upper bound of V(π))

    """
    z_score = sp.stats.norm.ppf(1 - kwargs["δ"] / 2)

    lb = min(1, max(0, kwargs["V_ips"] - z_score*(kwargs["var_V_ips"]**0.5)))
    ub = min(1, max(0, kwargs["V_ips"] + z_score*(kwargs["var_V_ips"]**0.5)))

    return min(lb, ub), max(lb, ub)

15.2.3 Clopper-Pearson

def clopper_pearson_bound(**kwargs):
    """Calculates the clopper pearson bound for the V_ips estimator
    0 <= V(π) <= 1
    https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower clopper-pearson bound of V(π), upper clopper-pearson bound of V(π))

    """
    k = kwargs["V_ips"] * kwargs["n"] / kwargs["w_max"]

    lb = sp.special.betaincinv(k, kwargs["n"] - k + 1, kwargs["δ"] / 2) if k > 0 and kwargs["n"] - k + 1 > 0 else 0
    ub = sp.special.betaincinv(k + 1, kwargs["n"] - k, 1 - (kwargs["δ"]/2)) if kwargs["n"] > k else 1

    lb = min(1, max(0, kwargs["w_max"] * lb))
    ub = min(1, max(0, kwargs["w_max"] * ub))

    return min(lb, ub), max(lb, ub)

15.2.4 Bootstrapping

15.2.5 Concentration Inequality: Hoeffding Inequality

(2)\[\begin{align} { \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}} \\ \end{align}\]

Hoeffding’s Inequality: Let \(X_1, ..., X_n\) be independent random variables strictly bounded by the interval \([a_i, b_i]\), \(a_i \leq X_i \leq b_i\). We define the empirical mean of these variables by \(\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\), such that

(3)\[\begin{align} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert \geq \epsilon) \leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \end{align}\]

From Hoeffding’s Inequality:

(4)\[\begin{align} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert \geq \epsilon) &\leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ 1 - P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \epsilon) &\leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \end{align}\]
(5)\[\begin{equation} \label{eq:1} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \epsilon) \geq \underbrace{1 - 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)}}_{1-\delta} \end{equation}\]

Finding \(\epsilon\):

(6)\[\begin{align} \therefore 1-\delta &= 1 - 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \delta &= 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \ln{\delta} &= \ln{2} - \frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2} \\ \ln{\frac{2}{\delta}} &= \frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2} \\ 2n^2\epsilon^2 &= \sum_{i=1}^{n}{(b_i - a_i)}^2 \ln{\frac{2}{\delta}} \\ \epsilon &= \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}} \\ \end{align}\]

Substituting \(\epsilon\) into Equation 1, we get our Hoeffding Bound Confidence Intervals:

(7)\[\begin{align} P\Bigg(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(-\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}< \bar{X} - \mathbb{E}[\bar{X}] < \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\bar{X}-\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}< \mathbb{E}[\bar{X}] < \bar{X}+\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \end{align}\]

Assuming that propensity scores \(\alpha_i\) are bounded between \([0, w_{max}=\text{max}_{i\in n}\frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i)}r_i]\), we can substitute the following:

  • \(\mathbb{E}[\bar{X}]: V(\pi)\)

  • \(X_i: \alpha_i = \frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i}r_i\)

  • \(\bar{X}: \hat{V_{\text{IPS}}}(\pi) = \frac{1}{n} \sum_{i=1}^{n}\frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i)}r_i\)

  • \(b_i: w_{max}\)

  • \(a_i: 0\)

(8)\[\begin{align} P\Bigg(\hat{V_{\text{IPS}}}(\pi)-\sqrt{\frac{\sum_{i=1}^{n}{(w_{max} - 0)}^2}{2n^2} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+\sqrt{\frac{\sum_{i=1}^{n}{(w_{max} - 0)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\hat{V_{\text{IPS}}}(\pi)-\sqrt{\frac{nw^2_{max}}{2n^2} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+\sqrt{\frac{nw^2_{max}}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\hat{V_{\text{IPS}}}(\pi)-w_{max}\sqrt{\frac{1}{2n} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+w_{max}\sqrt{\frac{1}{2n} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \end{align}\]
def hoeffding_bound(**kwargs):
    """Calculates the empirical hoeffding bounds for the V_ips estimator
    using the Hoeffding Inequality

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. Default = 0.05, meaning a 95% confidence interval

    Returns:
        Tuple: (lower empirical hoeffding bound of V(π), upper empirical hoeffding bound of V(π))

    """
    ε = kwargs["w_max"] * np.sqrt((1 / (2 * kwargs["n"])) * np.log(2 / kwargs["δ"]))
    lb, ub = kwargs["V_ips"] - ε, kwargs["V_ips"] + ε
    return lb, ub

15.2.6 Concentration Inequality: Bernstein Inequality

(9)\[\begin{align} { \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq { \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} } \\ \end{align}\]

Bernstein Inequality: Suppose \(X_1, \cdots, X_n\) are \(i.i.d.\) with 0 mean, variance \(\sigma^2\) and \(\vert X_i \vert \leq M\) almost surely,

(10)\[\begin{align} P\Bigg({ \left\vert \frac{1}{n}\sum^{n}_{i=1} X_i \right\vert } &\leq { \sqrt{\frac{2\sigma^2}{n}\log\frac{2}{\delta}} + \frac{2M}{3n}{\log\frac{2}{\delta}} }\Bigg) \geq 1 - \delta \end{align}\]

Since \(V_{\text{IPS}}(\pi)\) is an unbiased estimator of \(V(\pi)\), \(\hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi)\) has 0 mean, and variance of \(Var[\hat{V}_{\text{IPS}}(\pi)]\), and setting \(M: w_{max}\),

(11)\[\begin{align} P\Bigg({ \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq { \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg) \geq 1-\delta \end{align}\]
(12)\[\begin{align} P\Bigg(\hat{V}_{\text{IPS}}{(\pi)} - \Bigg({ \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg) &\leq {V}(\pi) \leq \hat{V}_{\text{IPS}}{(\pi)} + \Bigg({ \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg ) \Bigg) &\geq 1 - \delta \\ \end{align}\]
def bernstein_bound(**kwargs):
    """Calculates the empirical benrstein bounds for the V_ips estimator
    using the Bernstein Inequality

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower empirical bernstein bound of V(π), upper empirical bernstein bound of V(π))

    """
    ε = np.sqrt(2 * np.log(2 / kwargs["δ"]) * kwargs["var_V_ips"]) + (
        2 * kwargs["w_max"] * np.log(2 / kwargs["δ"])
    ) / (3 * kwargs["n"])

    # ε = np.sqrt(
    #     2 * np.log(2 / kwargs["δ"]) * (kwargs["var_V_ips"] / kwargs["n"])
    # ) + (2 * kwargs["w_max"] * np.log(2 / kwargs["δ"])) / (3 * kwargs["n"])

    lb, ub = kwargs["V_ips"] - ε, kwargs["V_ips"] + ε
    return lb, ub

15.3. Online Learning: Context-free

15.3.1 Random

15.3.2 Epsilon Greedy

15.3.3 Bernoulli Thompson Sampling


15.4. Online Learning: Contextual (Linear)

15.4.1 Linear Epsilon Greedy

15.4.2 Linear Thompson Sampling

15.4.3 Linear Upper Confidence Bound


15.5. Online Learning: Contextual (Logistic)

15.5.1 Logistic Epsilon Greedy

15.5.2 Logistic Thompson Sampling

15.5.3 Logistic Upper Confidence Bound


15.6. Offline (Off-Policy) Learning Algorithms

15.6.1 Inverse Probability Weighting (IPW) Learner