Bayesian Nonparametric Customer Base Analysis with Model-based Visualizations

This notebook accompanies the paper, Bayesian Nonparametric Customer Base Analysis with Model-based Visualizations, forthcoming at Marketing Science. In it, we illustrate the code for estimating the GPPM, and analyzing the results. We include the code needed to run the appendix simulations and the benchmark models separately in our replication files. Note that due to a non-disclosure agreement, we cannot use the true data for this illustration. Thus, we use simulated data that retains the general features of the true data.

We emphasize that all data used herein is synthetic, created by the authors via random number-based probabilistic simulation to exhibit similar aggregate patterns as in the data used in the paper.

Prerequisites:

This notebook relies on several packages:

rstan for estimation
foreach and doParallel for multicore, embarassingly parallel simulations
mvtnorm for simulating from the multivariate normal distribution
ggplot2 for plotting

require(rstan)
require(ggplot2)
require(mvtnorm)
require(doParallel)
require(foreach)

Data summary:

The data take the form of a decomposed purchasing (spend incidence) event log, with columns indicating the calendar time day (t), the time since first spend (l, for lifetime), the time since most recent spend (r, for recency), the purchase number (pnum), and finally whether or not a spend actually occurred (y) at that combination of inputs, for person with a given ID (id). There is also a residual column indicating the day on which this person first spent (fs). For this illustration, we use the simulated data that is similar to the life simulator game:

data <- read.csv("ls_sim.csv")
# data <- read.csv("cb_sim.csv")
head(data)

##   y id fs  t r l pnum
## 1 0  1 20 21 1 1    2
## 2 0  1 20 22 2 2    2
## 3 0  1 20 23 3 3    2
## 4 0  1 20 24 4 4    2
## 5 0  1 20 25 5 5    2
## 6 0  1 20 26 6 6    2

From this, we can visualize the time series of spending, which we will fit later:

plot(as.vector(table(data[data[,"y"]==1,"t"])), 
     type="l", ylab="Spends", xlab="Time", main="Spending over time (LS)")

Estimation: Calling Stan

From the data, we create a list that can be used as input to Stan:

data_list <- list(M=nrow(data),
                  N=length(unique(data[,"id"])),
                  P=max(data[,"t"])-min(data[,"t"])+1,
                  K=max(data[,"pnum"])-min(data[,"pnum"])+1,
                  id=data[,"id"],
                  t=data[,"t"]-min(data[,"t"])+1,
                  l=data[,"l"],
                  r=data[,"r"],
                  pnum=data[,"pnum"]-min(data[,"pnum"])+1,
                  y=data[,"y"])

We normalize several of these to ensure they start at 1. The quantities in this list correspond directly to the quantities in the data block of the gppm.stan file, which is also included in the replication documents.

The code below calls the Stan sampler. NOTE: This may take a long time to run. We set the number of iterations here to be much smaller than those used in the paper, for illustration purposes. Note that the last two arguments to stan() just tell Stan to drop these intermediate quantities before saving the output. These variables are only used within the Stan program for normalization purposes:

fit <- stan(file="gppm.stan",
            data=data_list,
            warmup=200,  ## may need to be increased for proper convergence
            iter=400,    ## may need to be increased for proper convergence
            chains=1,
            pars=c("free_week","free_short","free_rec",
                   "free_life","free_pnum","week_mean",
                   "short_mean","long_mean","rec_mean",
                   "life_mean","pnum_mean","z"),
            include=FALSE,
            seed=7926)

fit_list <- extract(fit)

Before sampling, there will appear several informational messages: these arise because of how we fix certain elements to zero for identification, as described in the paper, and can be ignored. We suppress them here in this notebook.

Once the Stan program finishes running, we extract the posterior samples with extract(), which we can use to create the model-based dashboards and tracking plots of interest. Because the Stan estimation can take a long time, we also include samples from a long run of the sampler, which can be loaded in lieu of running the model above. This loads the already computed fit_list object:

load("ls_fit.RData")
# load("cb_fit.RData")

Model-based Dashboard

The main output of interest of the GPPM is the model-based dashboard, which is simply a visualization of the latent function draws. It can be produced easily by plotting the posterior function draws:

dashdf <- with(fit_list, { rbind(data.frame(t=1:ncol(alpha_long),
                                 plot="Calendar, Long-run", 
                                 med=apply(alpha_long,2,median),
                                 lower=apply(alpha_long,2,quantile,probs=0.025),
                                 upper=apply(alpha_long,2,quantile,probs=0.975)),
                      data.frame(t=1:ncol(alpha_short),
                                 plot="Calendar, Short-run",
                                 med=apply(alpha_short,2,median),
                                 lower=apply(alpha_short,2,quantile,probs=0.025),
                                 upper=apply(alpha_short,2,quantile,probs=0.975)),
                      data.frame(t=1:ncol(alpha_week),
                                 plot="Calendar, Weekly",
                                 med=apply(alpha_week,2,median),
                                 lower=apply(alpha_week,2,quantile,probs=0.025),
                                 upper=apply(alpha_week,2,quantile,probs=0.975)),
                      data.frame(t=1:ncol(alpha_rec),
                                 plot="Recency",
                                 med=apply(alpha_rec,2,median),
                                 lower=apply(alpha_rec,2,quantile,probs=0.025),
                                 upper=apply(alpha_rec,2,quantile,probs=0.975)),
                      data.frame(t=1:ncol(alpha_life),
                                 plot="Lifetime",
                                 med=apply(alpha_life,2,median),
                                 lower=apply(alpha_life,2,quantile,probs=0.025),
                                 upper=apply(alpha_life,2,quantile,probs=0.975)),
                      data.frame(t=1:ncol(alpha_pnum),
                                 plot="Purchase Number",
                                 med=apply(alpha_pnum,2,median),
                                 lower=apply(alpha_pnum,2,quantile,probs=0.025),
                                 upper=apply(alpha_pnum,2,quantile,probs=0.975))) })

colvec <- c(rep("blue2",3),rep("red2",3))
ggplot()+
  geom_ribbon(data=dashdf,aes(x=t,ymin=lower,ymax=upper,fill=plot),alpha=0.15,color=NA)+
  geom_line(data=dashdf,aes(x=t,y=med,color=plot))+
  scale_color_manual(values=colvec)+
  scale_fill_manual(values=colvec)+
  facet_wrap(~plot,ncol=3,scales="free")+
  ylab("Function Value")+
  xlab("Input")+
  theme_bw()+
  theme(legend.position="none")

We see the results in this simulated data are basically the same as in the paper, although there seems to be less separation between the short and long-run calendar time effects. Running the sampler for more iterations, or with slightly more informative priors, may alleviate that.

Fit and Fit Decomposition

To see how the GPPM fits the data, we simulate spending according to the estimated propensities. To speed up this process, we parallelize these simulations:

cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)

logit <- function(x){1/(1+exp(-x))}

N <- length(unique(data[,"id"]))
tmin <- min(data[,"t"])
P <- ncol(fit_list$alpha_long)
niter <- nrow(fit_list$alpha_long)

## This can be changed to only use a random subset of the total HMC iterations for simulation:
nrep <- nrow(fit_list$alpha_long)

## Extract the posterior curves:
attach(fit_list)
  iters <- sample(1:niter, nrep, replace=ifelse(nrep>niter,T,F))
  alpha_long_rep <- alpha_long[iters,]
  alpha_short_rep <- alpha_short[iters,]
  alpha_week_rep <- alpha_week[iters,]
  alpha_rec_rep <- alpha_rec[iters,]
  alpha_life_rep <- alpha_life[iters,]
  alpha_pnum_rep <- alpha_pnum[iters,]
  delta <- delta[iters,]
detach(fit_list)

max_pnum <- ncol(alpha_pnum_rep)

rep_data <- foreach(i=1:N) %dopar% {
  tmin_i <- min(data[data[,"id"]==i,"t"])-tmin+1   
  delta_i <- delta[,i]
  
  irepdata <- array(NA,c(P,2,nrep))
  
  ipt <- rep(1,nrep)
  lt <- 1
  pnum <- rep(1,nrep)
  
  irepdata[1:(tmin_i-1),1,] <- 1:(tmin_i-1)
  irepdata[1:(tmin_i-1),2,] <- 0
  
  for(t in tmin_i:P){
    ipind <- cbind(1:length(ipt),ipt)
    pnumind <- cbind(1:length(pnum),ifelse(pnum>max_pnum,max_pnum,pnum))
    irepdata[t,,] <- rbind(rep(t,nrep),rbinom(nrep,1,logit(alpha_long_rep[,t]+alpha_short_rep[,t]+
                                                             alpha_week_rep[,t]+
                                                             alpha_rec_rep[ipind]+
                                                             alpha_life_rep[,lt]+
                                                             alpha_pnum_rep[pnumind]+
                                                             delta_i)))
    
    ipt <- ipt*(1-irepdata[t,2,])+1
    pnum <- pnum+irepdata[t,2,]
    lt <- lt+1
  }
  
  irepdata
}

## These just compile the simulation results:
combined <- do.call(rbind,lapply(rep_data, function(r){apply(r,3,function(x){x[,1]*x[,2]})}))
sim_sbd <- apply(combined, 2, function(x){table(factor(x[x>0], levels=1:P))})

# This computes the true spends-by-day in the data:
true_sbd <- as.vector(by(data[,"y"], data[,"t"], sum))

# This combines all of these to a data.frame that can be used by ggplot:
fitdf <- data.frame(t=rep(1:length(true_sbd),2),
                    points=c(true_sbd, apply(sim_sbd,1,median)),
                    curve=as.factor(c(rep(1,length(true_sbd)), rep(2,length(true_sbd)))),
                    lower=c(true_sbd, apply(sim_sbd,1,quantile,0.025)),
                    upper=c(true_sbd, apply(sim_sbd,1,quantile,0.975)))

Because these simulations can also take a while, especially without parallel processing, we again include an intermediate output file that contains the fitdf object:

# load("ls_fitplot_fitdf.RData")
# load("cb_fitplot_fitdf.RData")

With this, we can plot the fit:

ggplot(fitdf,aes(x=t,y=points,color=curve,linetype=curve))+
  geom_line()+
  geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.1,color=NA,fill="red")+
  ylab("Spends")+
  xlab("Calendar Time")+
  scale_color_manual(values=c("black","red"),labels=c("Data","GPP"),guide=guide_legend(title=NULL))+
  ggtitle("Aggregate Fit")+
  scale_linetype_manual(values=c("solid","longdash"),labels=c("Data","GPP"),guide=guide_legend(title=NULL))+
  theme_bw()+
  theme(legend.position="none")

To get any of the muted fits, as described in Section 3.1 of the paper, we can modify the above simulation. For example, to mute the short-run calendar time effect, we just modify the “Extract posterior curves” section of the code, to:

attach(fit_list)
  iters <- sample(1:niter, nrep, replace=ifelse(nrep>niter,T,F))
  alpha_long_rep <- alpha_long[iters,]
  alpha_short_rep <- t(apply(alpha_short[iters,], 1, function(x) rep(mean(x), P)))
  alpha_week_rep <- alpha_week[iters,]
  alpha_rec_rep <- alpha_rec[iters,]
  alpha_life_rep <- alpha_life[iters,]
  alpha_pnum_rep <- alpha_pnum[iters,]
  delta <- delta[iters,]
detach(fit_list)

This mutes the short-run component to fixing it to its mean over time (the apply call is needed to do this iteration by iteration). Then the rest of the simulation can be run as is to produce the muted fit.

Forecasting

To test the forecasting capabilities of the GPPM, we need to re-estimate the model, leaving off the last 30 days of spend data. We do not repeat the code for that here. The posterior estimates, without the last 30 days of data, are provided in separate files, which we load:

isdata <- data[data[,"t"] < (P-29),]
load("ls_fit_fore.RData")
# load("cb_fit_fore.RData")

To do the forecasting, we first need to draw from the posterior predictive for the latent GP curves:

logit <- function(x){1/(1+exp(-x))}

se_cov_func <- function(x1,x2,etasq,rhosq){ ifelse(x1==x2,etasq+0.001,etasq*exp((-(x1-x2)^2)/rhosq)) }
cyc_cov_func <- function(x1,x2,etasq,rhosq,c=7){ ifelse(x1==x2,etasq+0.001,
                                                        etasq*exp(-2*(sin(pi*(x1-x2)/c)^2)/rhosq))}

make.K.SE <- function(P,etasq,rhosq){
  K <- matrix(NA,P,P)
  for(i in 1:P){
    for(j in i:P){
      K[i,j] <- se_cov_func(i,j,etasq,rhosq)
      K[j,i] <- K[i,j]
    } }
  K
}
make.K.Per <- function(P,etasq,rhosq,c=7){
  K <- matrix(NA,P,P)
  for(i in 1:P){
    for(j in i:P){
      K[i,j] <- cyc_cov_func(i,j,etasq,rhosq,c)
      K[j,i] <- K[i,j]
    } }
  K
}

# Draw from the GP predictive distribution:
gpfore <- function(func, cmf=0, nlmf=NA, covf, tfore){
  require(mvtnorm)
  
  tmax <- length(func)
  x <- 1:tmax
  xstar <- (tmax+1):(tmax+tfore)
  
  if(is.na(nlmf[1])){
    pred_mean <- rep(cmf,tfore) + covf[xstar,x] %*% chol2inv(chol(covf[x,x])) %*% (func-rep(cmf,tmax))
    pred_cov <- covf[xstar,xstar] - covf[xstar,x] %*% chol2inv(chol(covf[x,x])) %*% t(covf[xstar,x])
    pred_cov <- (pred_cov + t(pred_cov))/2 + diag(rep(0.0001,tfore))
  } else {
    mf_in <- nlmf[1]*(x-1)^nlmf[2]
    mf_fore <- nlmf[1]*(xstar-1)^nlmf[2]
    
    pred_mean <- mf_fore + covf[xstar,x] %*% chol2inv(chol(covf[x,x])) %*% (func - mf_in)
    pred_cov <- covf[xstar,xstar] - covf[xstar,x] %*% chol2inv(chol(covf[x,x])) %*% t(covf[xstar,x])
    pred_cov <- (pred_cov + t(pred_cov))/2 + diag(rep(0.0001,tfore))
  }
  c(func,rmvnorm(1,pred_mean,pred_cov,method="svd"))
}




# Extracting some in-sample information 
tmin <- min(isdata[,"t"])
is_tmax <- P-30
tfore <- 30

tmin_vec <- rep(NA,N)
for(i in 1:N){ tmin_vec[i] <- min(isdata[isdata[,"id"]==i,"t"])-tmin+1 }


# Drawing from the posterior predictive for the latent GPs:
fore_fl <- fit_list

attach(fit_list)
niter <- nrow(alpha_long)
tmax <- ncol(alpha_long)
P <- tmax+tfore

fore_fl$alpha_long <- matrix(NA,niter,P)
fore_fl$alpha_short <- matrix(NA,niter,P)
fore_fl$alpha_rec <- matrix(NA,niter,P)
fore_fl$alpha_week <- matrix(NA,niter,P)
fore_fl$alpha_life <- matrix(NA,niter,P)
fore_fl$alpha_pnum <- matrix(NA,niter,100)

for(k in 1:niter){
  cal1_cov <- make.K.SE(P,etasq_long[k],rhosq_cal[k,2])
  cal2_cov <- make.K.SE(P,etasq_short[k],rhosq_cal[k,1])
  life_cov <- make.K.SE(P,etasq_life[k],rhosq_life[k])
  rec_cov <- make.K.SE(P,etasq_rec[k],rhosq_rec[k])
  week_cov <- make.K.Per(P,etasq_week[k],rhosq_week[k])
  pnum_cov <- make.K.SE(100,etasq_pnum[k],rhosq_pnum[k])
  
  fore_fl$alpha_long[k,] <- gpfore(alpha_long[k,], cmf=mu[k], covf=cal1_cov, tfore=tfore)
  fore_fl$alpha_short[k,] <- gpfore(alpha_short[k,], covf=cal2_cov, tfore=tfore)
  fore_fl$alpha_life[k,] <- gpfore(alpha_life[k,], nlmf=c(lambda_life1[k],lambda_life2[k]), covf=life_cov, tfore=tfore)
  fore_fl$alpha_rec[k,] <- gpfore(alpha_rec[k,], nlmf=c(lambda_rec1[k],lambda_rec2[k]), covf=rec_cov, tfore=tfore)
  fore_fl$alpha_week[k,] <- gpfore(alpha_week[k,], covf=week_cov, tfore=tfore)
  fore_fl$alpha_pnum[k,] <- gpfore(alpha_pnum[k,], nlmf=c(lambda_pnum1[k],lambda_pnum2[k]), covf=pnum_cov, tfore=(100-ncol(fit_list$alpha_pnum)))
}
detach(fit_list)

Once we have the predictions of the latent functions, stored in the fit list, fore_fl, we can repeat our fit simulations from the previous section:

nrep <- niter
attach(fore_fl)
  iters <- sample(1:niter,nrep,replace=ifelse(nrep>niter,T,F))
  alpha_long_rep <- alpha_long[iters,]
  alpha_short_rep <- alpha_short[iters,]
  alpha_week_rep <- alpha_week[iters,]
  alpha_rec_rep <- alpha_rec[iters,]
  alpha_life_rep <- alpha_life[iters,]
  alpha_pnum_rep <- alpha_pnum[iters,]
  delta <- fit_list$delta[iters,]
detach(fore_fl)

rep_data <- foreach(i=1:N) %dopar% {
  tmin_i <- tmin_vec[i]
  delta_i <- delta[,i]
  
  irepdata <- array(NA,c(P,2,nrep))
  
  ipt <- rep(1,nrep)
  lt <- 1
  pnum <- rep(1,nrep)
  
  irepdata[1:(tmin_i-1),1,] <- 1:(tmin_i-1)
  irepdata[1:(tmin_i-1),2,] <- 0
  for(t in tmin_i:P)
  {
    ipind <- cbind(1:length(ipt),ipt)
    pnumind <- cbind(1:length(pnum),pnum)
    irepdata[t,,] <- rbind(rep(t,nrep),rbinom(nrep,1,logit(alpha_long_rep[,t]+alpha_short_rep[,t]+
                                                                 alpha_week_rep[,t]+
                                                                 alpha_rec_rep[ipind]+
                                                                 alpha_life_rep[,lt]+
                                                                 alpha_pnum_rep[pnumind]+
                                                                 delta_i)))
    
    ipt <- ipt*(1-irepdata[t,2,])+1
    pnum <- pnum+irepdata[t,2,]
    lt <- lt+1
  }
  irepdata
}

combined <- do.call(rbind,lapply(rep_data, function(r){apply(r,3,function(x){x[,1]*x[,2]})}))
pred_sbd <- apply(combined,2,function(x){table(factor(x[x>0],levels=1:P))})

fitdf <- data.frame(t=rep(1:length(true_sbd),2),
                    points=c(true_sbd,apply(pred_sbd,1,median)),
                    curve=as.factor(c(rep(1,length(true_sbd)),rep(2,length(true_sbd)))),
                    lower=c(true_sbd,apply(pred_sbd,1,quantile,0.05)),
                    upper=c(true_sbd,apply(pred_sbd,1,quantile,0.95)))

Once again, we have saved this intermediate fitdf object. We can then plot using this data.frame, again using ggplot2. Again, we store the intermediate results, to save time:

# load("ls_foreplot_fitdf.RData")
# load("cb_foreplot_fitdf.RData")

ggplot(fitdf,aes(x=t,y=points,color=curve,linetype=curve))+
  geom_line()+
  geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.1,color=NA,fill="red")+
  ylab("Spends")+
  xlab("Calendar Time")+
  scale_color_manual(values=c("black","red"),labels=c("Data","GPP"),
                     guide=guide_legend(title=NULL))+
  ggtitle("Forecast")+
  scale_linetype_manual(values=c("solid","longdash"),labels=c("Data","GPP"),
                        guide=guide_legend(title=NULL))+
  geom_vline(xintercept=max(fitdf[,"t"])-30, linetype="dotted") +
  theme_bw()+
  theme(legend.position="none")

Aside: Forecasting Trick

In the code above, we coded the predictive distribution of the GP draws manually in R. This is how we produced all of the results in the paper. However, in practice, Stan can be used to do this more efficiently. In particular, when specifying the initial data list, if the number of periods P and the max number of spends K are made larger than is seen in the data, then Stan will automatically draw from the predictive distribution for the GP for these heldout periods. Specifically, if the data list is modified as:

data_list <- list(M=nrow(data),
                  N=length(unique(data[,"id"])),
                  P=max(data[,"t"])-min(data[,"t"])+1+50,  ## additional +50 for forecasting 50 periods
                  K=max(data[,"pnum"])-min(data[,"pnum"])+1+50,  ## additional +50 for forecasting 50 periods
                  id=data[,"id"],
                  t=data[,"t"]-min(data[,"t"])+1,
                  l=data[,"l"],
                  r=data[,"r"],
                  pnum=data[,"pnum"]-min(data[,"pnum"])+1,
                  y=data[,"y"])

Then, in the extracted fit object, the alpha_xxxx objects (e.g. alpha_long for the long-run calendar component) will now have dimension niter x (P+50). The first P columns will be the same as before, and the last 50 columns will contain the predictive draws of the GP for periods P+1 through P+50.