Boosting with tree and forest classifiers

XGBoost is a popular ML library. It’s got the term boost in its name, so clearly the act of “boosting” is central to the performance of this algorithm. What does boosting mean with respect to decision trees and forests? Let’s investigate this issue here.

Hastie et al describe boosting as “…a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” As they make clear in their seminal text, the boosting algorithm (due to Freund and Schapire (1997)) successively constructs weak classifiers (meaning they are pretty randomly generated and therefore don’t do a good job of classification) on training data that differs in each successive step. In the end, we classify by weighting all of the weak classifiers according. Here’s a schematic from the ESL text.

The training data changes in each step through the boosting process. We initially train on the original data, but, after each iteration, we adjust the training data, and give more weight to the observations that were previously misclassified. In this way, we boost hard to classify observations.

ESL outlines the pseudo-algorithm here:

As in the post on forests, I will try to recreate the simulation Hastie et al run.

Generate data

First, we generate the training and test data.

#make_boost_data

#boosting from ESL p. 339
# create train data creation function. n = # obs= user input

create_train_data = function(n){
  # create features x1 through x10 (random normal)
  for(i in 1:10){
  	# use assign function to create variable names (x1, x2...)
    assign(paste("x", i, sep = ""), rnorm(n))    
  }
  
  # create label vay (y) as per ESL p. 339 (eq. 10.2)
  y_lim=qchisq(.5,10)
  y=ifelse(x1^2+x2^2+x3^2
           +x4^2+x5^2+x6^2
           +x7^2+x8^2
           +x9^2+x10^2>y_lim,1,-1)
  
  # combine into df
  df = data.frame(x1,x2,x3,x4,
                  x5,x6,x7,x8,
                  x9,x10,y)
  # make y as factor for rpart class prediction 
  df$y=as.factor(df$y)
  return(df)
}

# create test data creation function. same as train data generation 
create_test_data = function(n_test){
  # create features
  for(i in 1:10){
    assign(paste("x", i, sep = ""), rnorm(n_test))    
  }
  
  # create label var (y)
  y_lim=qchisq(.5,10)
  y=ifelse(x1^2+x2^2+x3^2
           +x4^2+x5^2+x6^2
           +x7^2+x8^2
           +x9^2+x10^2>y_lim,1,-1)
  
  df = data.frame(x1,x2,x3,x4,
                  x5,x6,x7,x8,
                  x9,x10,y)
  df$y=as.factor(df$y)
  return(df)
  # sum(y==1) # should be around 1000 according to ESL.
  # it is
}

Fit initial model

If we use a 1 stump classifier, we can see the results on the test set match the ESL text. We use the R rpart library’s rpart function to fit a tree. We fit y on all features (x1…xn), using the “class” method (is classification and not regression). We use our train_df created by the create_train_data function as our training observations. We use the rpart.control option to limit the depth of the tree to 1. We are creating stumps.

library(rpart)
fit <- rpart::rpart(y ~ x1+x2+x3+x4+x5+
               x6+x7+x8+x9+x10,
             method="class", data=train_df,
             control=rpart.control(maxdepth=1))

If we use rpart.plot, we can see what this stump looks like.

library(rpart.plot)
rpart.plot(fit,type=0)

To understand what this image shows…

If x3>= 1.2, we predict y=-1. .47=47% of our training data was classified this way, and the accuracy when following that rule is 87% (e.g. 87% of y values in our training set =-1, when x3>=-1.2).
Else, we predict y=1 (with a similar explanation applying).

(FYI - this is a pretty bad fitting model. We can see from the data generating function that one single split won’t do well to model the data, so here a fairly arbitrary split is made.)

Error rates

We then create functions to return the error rates when fed a model and data. We split the train error function into two, as it will be useful to have a unique function to spit out predictions when we try to implement algorithm 10.1.

# function to generate predicted y values, given a model (fit) and test data
predictions = function(fit,train_df){
  # predict function with "vector" type returns 1,2 as outputs (corresponding to #-1 and 1)	
  train_p=predict(fit,newdata=train_df,type="vector")
  # replace to match -1 and 1 with training label
  train_p=replace(train_p, train_p==1, -1)
  train_p=replace(train_p, train_p==2, 1)
  return(train_p)
}

# given predictions and class labels...report error rate
train_error_rate = function(train_p,train_df){
  train_success=sum(train_p==train_df$y)/length(train_p)
  return(1-train_success) # train success rate
}

# redundant function...do above functions, in 1
test_error_rate = function(fit,test_df){
  p=predict(fit,newdata=test_df,type="vector")
  # codes 0s as 2s for some reason
  # run this in order!
  p=replace(p, p==1, -1)
  p=replace(p, p==2, 1)
  #head(p)
  #head(test_df$y)
  test_success = sum(p==test_df$y)/length(p) # test success!
  return(1-test_success) # test error rate matches ESL
}

predictions(fit,train_df)
train_error_rate(fit,train_df)
test_error_rate(fit,tes_df)

Now, we implement the algorithm.

# create empty lists for errm, alpham, and predictions at each step through
err=c()
alpha=c()
preds=c()

# function to return boosted model
# take m = number of boosting iterations, training data and testing data
# as inputs

boost = function(m,train_df,test_df){
  # how big is training data?
  n = nrow(train_df)
  # create weights
  w = rep(1/n,n) #where n is number of training obs

  # repeat boosting alg (10.1) m times
  for(i in seq(m)){
    # set training_df to take into account weights
    boost_fit <-rpart(y ~ x1+x2+x3+x4+x5+
                          x6+x7+x8+x9+x10,
                          method="class", data=train_df,
                          weights = w,
                          control=rpart.control(maxdepth=1)) 
    # load predictions function.
    source('predict_error.R')
    # generate predicted values on test set, given most recent fit (G)
    # store in list pred
    preds=c(preds,predictions(boost_fit,test_df))

    predictions=predictions(boost_fit,train_df) # find predictions on train to keep working on model
    
    # use matrix multiply to find sigma of products
    err=c(err,(w%*%(predictions!=train_df$y)) / (sum(w)) )
    # in line above, we are finding sigma (wi * indicator function)
    # what we are doing is finding the weighted error. 
    # finding predictions!=train_df$y gives errors, so the initial
    # round is just the raw error rate. subsequent rounds are weighted error rates
    
    alpha=c(alpha,log((1-err[i])/(err[i]))) # reference first error (i=1)
    # if error rate high, this approaches neg inf
    # if error rate low (near 0), this approaches + inf
    
    numeric_mismatch=as.numeric(predictions!=train_df$y)
    w = w*exp(alpha[i]*numeric_mismatch) # set new weights
    # if we missed prediction, w changes to w* e^alpha. if error rate
    # was high, this was like e^-inf = 0...that doesnt make sense?
    
  }
  # how many rows are in test set?
  num_test_obs=dim(test_df)[1]

  summ=rep(0,num_test_obs)
  chunks = list()
  for(i in 1:m){
  	# generate list of siz num_test_obs that contains predictions
    chunks[[i]] = preds[(1+((i-1)*num_test_obs)) : (num_test_obs+((i-1)*num_test_obs))]
    # add running weighted  sum of predictions. this is step 3 in alg. 10.1
    summ = summ+alpha[i]*chunks[[i]]
  }
  # finish and return predicted values
  vals=ifelse(summ>0,1,-1)
  return(vals) 
}

Let’s work through this step by step. First we create empty lists for us to compute err_m and alpha_m. In the boost function, we initialize the weights, so every observation has weight 1/n (with n the number of obs). Then, we enter the for loop as per line 2.

For 2a), we fit a stump:

boost_fit <- rpart(y ~ x1+x2+x3+x4+x5+
                   x6+x7+x8+x9+x10,
                 method="class", data=train_df,
                 weights = w,
                 control=rpart.control(maxdepth=1)) # fit initial tree

For 2b), we compute err_m. To find the sum of w_i * I, we use matrix multiplication to find the dot product. To generate the indicator function (such that I = 1 when we have MISpredicted, we run the following):

predictions!=train_df$y

For 2c), we compute alpha_i. If we correctly predict, w stays the same )w = wexp(0)). If we mispredict, then we multiple wlog((1-err)/err). This has the effect of making w larger when our error approaches 0.

alpha=c(alpha,log((1-err[i])/(err[i]))) # reference first error (i=1)

w = w*exp(alpha[i]*numeric_mismatch) # set new weights

The last bit of code adds the predictions at each step. We take preds, which is a long list of predictions (nrow(test_df)*m) and split this back up into m lists of size nrow(test_df). Then we weight by alpha and sum. We return the predicted values for test_df, found when trained with train_df.

HSL find the following with this data and their simulations.

When I recreate (and lower boosting iterations, as running for 1:400 iterations takes a bit long), I find: