Uplift Modelling and Contextual Bandits [In Progress]¶

By: Chengyi (Jeff) Chen

%load_ext autotime
%load_ext nb_black

import causalml
import obp

Introduction¶

Having worked with uplift modelling at Shopee and contextual bandits at Gojek, I’ve come to realize huge similarities between the way these two seemingly disparate problems are modelled. In the following post, I’ll do my best to reconcile the relationship between these two modelling approaches and show how uplift modelling is simply a subclass of contextual bandits. I’ll be going through uplift modelling with the help of Uber’s Causal ML package and contextual bandits with Open Bandit Pipeline. In particular, I’m interested in comparing the different algorithms used for off-policy evaluation

Preliminaries¶

What is uplift modelling?¶

What are contextual bandits?¶

1. Multi-armed Bandit ¶

2. Contextual Bandit¶

3. Full Reinforcement Learning¶

https://stats.stackexchange.com/questions/89396/multi-armed-bandit-algorithms-vs-uplift-modeling

15.2. Off-Policy Confidence Intervals¶

15.2.1 T-distribution (Population Variance Unknown)¶

def student_t_bound(**kwargs):
    """Calculates the Student T's confidence interval for the V_ips estimator
    If V_ips = 1/nΣ_nX = μ, then var[V_ips] = sample variance / n = (1/(n - 1)Σ_n(X - μ)^2) / n

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower bound of V(π), upper bound of V(π))

    """
    t_score = sp.stats.t.ppf(1 - kwargs["δ"] / 2)

    lb = min(1, max(0, kwargs["V_ips"] - t_score*(kwargs["var_V_ips"]**0.5)))
    ub = min(1, max(0, kwargs["V_ips"] + t_score*(kwargs["var_V_ips"]**0.5)))

    return min(lb, ub), max(lb, ub)

15.2.2 Asymptotically Gaussian (Central Limit Theorem)¶

def asymptotic_gaussian_bound(**kwargs):
    """Calculates the asymptotically Gaussian confidence interval for the V_ips estimator
    As n --> \infty, by CLT, sampling distribution of the sample mean of the random variable,
    in our case: importance-weighted rewards are our random variable α.
    If V_ips = 1/nΣ_nX = μ, then var[V_ips] = sample variance / n = (1/(n - 1)Σ_n(X - μ)^2) / n
    
    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower bound of V(π), upper bound of V(π))

    """
    z_score = sp.stats.norm.ppf(1 - kwargs["δ"] / 2)

    lb = min(1, max(0, kwargs["V_ips"] - z_score*(kwargs["var_V_ips"]**0.5)))
    ub = min(1, max(0, kwargs["V_ips"] + z_score*(kwargs["var_V_ips"]**0.5)))

    return min(lb, ub), max(lb, ub)

15.2.3 Clopper-Pearson¶

def clopper_pearson_bound(**kwargs):
    """Calculates the clopper pearson bound for the V_ips estimator
    0 <= V(π) <= 1
    https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower clopper-pearson bound of V(π), upper clopper-pearson bound of V(π))

    """
    k = kwargs["V_ips"] * kwargs["n"] / kwargs["w_max"]

    lb = sp.special.betaincinv(k, kwargs["n"] - k + 1, kwargs["δ"] / 2) if k > 0 and kwargs["n"] - k + 1 > 0 else 0
    ub = sp.special.betaincinv(k + 1, kwargs["n"] - k, 1 - (kwargs["δ"]/2)) if kwargs["n"] > k else 1

    lb = min(1, max(0, kwargs["w_max"] * lb))
    ub = min(1, max(0, kwargs["w_max"] * ub))

    return min(lb, ub), max(lb, ub)

15.2.4 Bootstrapping¶

15.2.5 Concentration Inequality: Hoeffding Inequality¶

(2)¶\[\begin{align} { \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}} \\ \end{align}\]

Hoeffding’s Inequality: Let \(X_1, ..., X_n\) be independent random variables strictly bounded by the interval \([a_i, b_i]\), \(a_i \leq X_i \leq b_i\). We define the empirical mean of these variables by \(\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\), such that

(3)¶\[\begin{align} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert \geq \epsilon) \leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \end{align}\]

From Hoeffding’s Inequality:

(4)¶\[\begin{align} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert \geq \epsilon) &\leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ 1 - P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \epsilon) &\leq 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \end{align}\]

(5)¶\[\begin{equation} \label{eq:1} P(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \epsilon) \geq \underbrace{1 - 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)}}_{1-\delta} \end{equation}\]

Finding \(\epsilon\):

(6)¶\[\begin{align} \therefore 1-\delta &= 1 - 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \delta &= 2{e}^{\Big(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2}\Big)} \\ \ln{\delta} &= \ln{2} - \frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2} \\ \ln{\frac{2}{\delta}} &= \frac{2n^2\epsilon^2}{\sum_{i=1}^{n}{(b_i - a_i)}^2} \\ 2n^2\epsilon^2 &= \sum_{i=1}^{n}{(b_i - a_i)}^2 \ln{\frac{2}{\delta}} \\ \epsilon &= \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}} \\ \end{align}\]

Substituting \(\epsilon\) into Equation 1, we get our Hoeffding Bound Confidence Intervals:

(7)¶\[\begin{align} P\Bigg(\left\vert \bar{X} - \mathbb{E}[\bar{X}] \right\vert < \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(-\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}< \bar{X} - \mathbb{E}[\bar{X}] < \sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\bar{X}-\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}< \mathbb{E}[\bar{X}] < \bar{X}+\sqrt{\frac{\sum_{i=1}^{n}{(b_i - a_i)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \end{align}\]

Assuming that propensity scores \(\alpha_i\) are bounded between \([0, w_{max}=\text{max}_{i\in n}\frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i)}r_i]\), we can substitute the following:

\(\mathbb{E}[\bar{X}]: V(\pi)\)
\(X_i: \alpha_i = \frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i}r_i\)
\(\bar{X}: \hat{V_{\text{IPS}}}(\pi) = \frac{1}{n} \sum_{i=1}^{n}\frac{\pi(a_i\vert x_i)}{\mu(a_i\vert x_i)}r_i\)
\(b_i: w_{max}\)
\(a_i: 0\)

(8)¶\[\begin{align} P\Bigg(\hat{V_{\text{IPS}}}(\pi)-\sqrt{\frac{\sum_{i=1}^{n}{(w_{max} - 0)}^2}{2n^2} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+\sqrt{\frac{\sum_{i=1}^{n}{(w_{max} - 0)}^2}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\hat{V_{\text{IPS}}}(\pi)-\sqrt{\frac{nw^2_{max}}{2n^2} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+\sqrt{\frac{nw^2_{max}}{2n^2} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \\ P\Bigg(\hat{V_{\text{IPS}}}(\pi)-w_{max}\sqrt{\frac{1}{2n} \ln{\frac{2}{\delta}}} < V(\pi) < \hat{V_{\text{IPS}}}(\pi)+w_{max}\sqrt{\frac{1}{2n} \ln{\frac{2}{\delta}}}\Bigg) &\geq 1-\delta \end{align}\]

def hoeffding_bound(**kwargs):
    """Calculates the empirical hoeffding bounds for the V_ips estimator
    using the Hoeffding Inequality

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. Default = 0.05, meaning a 95% confidence interval

    Returns:
        Tuple: (lower empirical hoeffding bound of V(π), upper empirical hoeffding bound of V(π))

    """
    ε = kwargs["w_max"] * np.sqrt((1 / (2 * kwargs["n"])) * np.log(2 / kwargs["δ"]))
    lb, ub = kwargs["V_ips"] - ε, kwargs["V_ips"] + ε
    return lb, ub

15.2.6 Concentration Inequality: Bernstein Inequality¶

(9)¶\[\begin{align} { \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq { \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} } \\ \end{align}\]

Bernstein Inequality: Suppose \(X_1, \cdots, X_n\) are \(i.i.d.\) with 0 mean, variance \(\sigma^2\) and \(\vert X_i \vert \leq M\) almost surely,

(10)¶\[\begin{align} P\Bigg({ \left\vert \frac{1}{n}\sum^{n}_{i=1} X_i \right\vert } &\leq { \sqrt{\frac{2\sigma^2}{n}\log\frac{2}{\delta}} + \frac{2M}{3n}{\log\frac{2}{\delta}} }\Bigg) \geq 1 - \delta \end{align}\]

Since \(V_{\text{IPS}}(\pi)\) is an unbiased estimator of \(V(\pi)\), \(\hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi)\) has 0 mean, and variance of \(Var[\hat{V}_{\text{IPS}}(\pi)]\), and setting \(M: w_{max}\),

(11)¶\[\begin{align} P\Bigg({ \left\vert \hat{V}_{\text{IPS}}{(\pi)} - {V}(\pi) \right\vert } &\leq { \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg) \geq 1-\delta \end{align}\]

(12)¶\[\begin{align} P\Bigg(\hat{V}_{\text{IPS}}{(\pi)} - \Bigg({ \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg) &\leq {V}(\pi) \leq \hat{V}_{\text{IPS}}{(\pi)} + \Bigg({ \sqrt{ {2\text{log}\frac{2}{\delta}}{\frac{{\text{Var}}_{(x, a, r) \sim \mu}[\hat{V}_{\text{IPS}}{(\pi)}]}{n}}} + \frac{2{\hat{w}_{max}}}{3n}{\text{log} \frac{2}{\delta}} }\Bigg ) \Bigg) &\geq 1 - \delta \\ \end{align}\]

def bernstein_bound(**kwargs):
    """Calculates the empirical benrstein bounds for the V_ips estimator
    using the Bernstein Inequality

    Args:
        V_ips (float): Estimated value of the policy π using the Inverse Propensity Score estimator
        var_V_ips (float): Variance of V_ips
        w_max (float): Largest Ratio of current policy probability of taking logging policy action a_i given context x_i to logging policy's probability of taking action a_i given context x_i - π(a_i|x_i) / μ(a_i | x_i)
        n (int): Size of the logging policy dataset
        δ (float): Significance level for confidence coverage. If δ = 0.05, it's a 95% confidence interval

    Returns:
        Tuple: (lower empirical bernstein bound of V(π), upper empirical bernstein bound of V(π))

    """
    ε = np.sqrt(2 * np.log(2 / kwargs["δ"]) * kwargs["var_V_ips"]) + (
        2 * kwargs["w_max"] * np.log(2 / kwargs["δ"])
    ) / (3 * kwargs["n"])

    # ε = np.sqrt(
    #     2 * np.log(2 / kwargs["δ"]) * (kwargs["var_V_ips"] / kwargs["n"])
    # ) + (2 * kwargs["w_max"] * np.log(2 / kwargs["δ"])) / (3 * kwargs["n"])

    lb, ub = kwargs["V_ips"] - ε, kwargs["V_ips"] + ε
    return lb, ub

ΨΦ

Uplift Modelling and Contextual Bandits [In Progress]¶

Introduction¶

Preliminaries¶

What is uplift modelling?¶

What are contextual bandits?¶

1. Multi-armed Bandit ¶

2. Contextual Bandit¶

3. Full Reinforcement Learning¶

15.2. Off-Policy Confidence Intervals¶

15.2.1 T-distribution (Population Variance Unknown)¶

15.2.2 Asymptotically Gaussian (Central Limit Theorem)¶

15.2.3 Clopper-Pearson¶

15.2.4 Bootstrapping¶

15.2.5 Concentration Inequality: Hoeffding Inequality¶

15.2.6 Concentration Inequality: Bernstein Inequality¶

15.3. Online Learning: Context-free¶

15.3.1 Random¶

15.3.2 Epsilon Greedy¶

15.3.3 Bernoulli Thompson Sampling¶

15.4. Online Learning: Contextual (Linear)¶

15.4.1 Linear Epsilon Greedy¶

15.4.2 Linear Thompson Sampling¶

15.4.3 Linear Upper Confidence Bound¶

15.5. Online Learning: Contextual (Logistic)¶

15.5.1 Logistic Epsilon Greedy¶

15.5.2 Logistic Thompson Sampling¶

15.5.3 Logistic Upper Confidence Bound¶

15.6. Offline (Off-Policy) Learning Algorithms¶

15.6.1 Inverse Probability Weighting (IPW) Learner¶

ΨΦ

Uplift Modelling and Contextual Bandits [In Progress]¶

Introduction¶

Preliminaries¶

What is uplift modelling?¶

What are contextual bandits?¶

1. Multi-armed Bandit¶

2. Contextual Bandit¶

3. Full Reinforcement Learning¶

15.2. Off-Policy Confidence Intervals¶

15.2.1 T-distribution (Population Variance Unknown)¶

15.2.2 Asymptotically Gaussian (Central Limit Theorem)¶

15.2.3 Clopper-Pearson¶

15.2.4 Bootstrapping¶

15.2.5 Concentration Inequality: Hoeffding Inequality¶

15.2.6 Concentration Inequality: Bernstein Inequality¶

15.3. Online Learning: Context-free¶

15.3.1 Random¶

15.3.2 Epsilon Greedy¶

15.3.3 Bernoulli Thompson Sampling¶

15.4. Online Learning: Contextual (Linear)¶

15.4.1 Linear Epsilon Greedy¶

15.4.2 Linear Thompson Sampling¶

15.4.3 Linear Upper Confidence Bound¶

15.5. Online Learning: Contextual (Logistic)¶

15.5.1 Logistic Epsilon Greedy¶

15.5.2 Logistic Thompson Sampling¶

15.5.3 Logistic Upper Confidence Bound¶

15.6. Offline (Off-Policy) Learning Algorithms¶

15.6.1 Inverse Probability Weighting (IPW) Learner¶

1. Multi-armed Bandit ¶