Implementing AdaBoost using Logistic regression weak classifiers
This notebook explores the well known AdaBoost M1 algorithm which combines several weak classifiers to create a better overall classifier. The notebook consists of three main sections:
A review of the Adaboost M1 algorithm and an intuitive visualization of its inner workings
An implementation from scratch in Python, using an Sklearn decision tree stump as the weak classifier
A discussion on the trade-off between the Learning rate and Number of weak classifiers parameters
This notebook was used as a basis for the following answers on stats.stackexchange
This implementation relies on a simple decision tree stum with maximum depth = 1 and 2 leaf nodes. Of interest is the plot of decision boundaries for different weak learners inside the AdaBoost combination, together with their respective sample weights.
An intuitive explanation of AdaBoost algorithn
AdaBoost recap
Let $G_m(x) m = 1,2,…,M$ be the sequence of weak classifiers, our objective is to build the following:
The final prediction is a combination of the predictions from all classifiers through a weighted majority vote
The coefficients $\alpha_m$ are computed by the boosting algorithm, and weight the contribution of each respective $G_m(x)$. The effect is to give higher influence to the more accurate classifiers in the sequence.
At each boosting step, the data is modified by applying weights $w_1, w_2,…,w_N$ to each training observation. At step $m$ the observations that were misclassified previously have their weights increased
Note that at the first step $m=1$ the weights are initialized uniformly $w_i = 1 / N$
The algorith, as summarised on page 339 of Elements of Statistical Learning
Consider the toy data set on which I have applied AdaBoost with the following settings: Number of iterations $M = 10$, weak classifier = Decision Tree of depth 1 and 2 leaf nodes. The boundary between red and blue data points is clearly non linear, yet the algorithm does pretty well.
Visualizing the sequence of weak learners and the sample weights
The first 6 weak learners $m = 1,2…,6$ are shown below. The scatter points are scaled according to their respective sample weight at each iteration
First iteration:
The decision boundary is very simple (linear) since these are wea learners
All points are of the same size, as expected
6 blue points are in the red region and are misclassified
Second iteration:
The linear decision boundary has changed
The previously misclassified blue points are now larger (greater sample_weight) and have influenced the decision boundary
9 blue points are now misclassified
Final result after 10 iterations
All classifiers have a linear decision boundary, at different positions. The resulting coefficients of the first 6 iterations $\alpha_m$ are :
([1.041, 0.875, 0.837, 0.781, 1.04 , 0.938…
As expected, the first iteration has largest coefficient as it is the one with the fewest mis-classifications.
Next steps
An intuitive explanation of gradient boosting - to be completed
#Toy Dataset x1 = np.array([.1,.2,.4,.8, .8, .05,.08,.12,.33,.55,.66,.77,.88,.2,.3,.4,.5,.6,.25,.3,.5,.7,.6]) x2 = np.array([.2,.65,.7,.6, .3,.1,.4,.66,.77,.65,.68,.55,.44,.1,.3,.4,.3,.15,.15,.5,.55,.2,.4]) y = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]) X = np.vstack((x1,x2)).T
def plot_decision_boundary(classifier, X, y, N = 10, scatter_weights = np.ones(len(y)) , ax = None ): '''Utility function to plot decision boundary and scatter plot of data''' x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
#Check what methods are available if hasattr(classifier, "decision_function"): zz = np.array( [classifier.decision_function(np.array([xi,yi]).reshape(1,-1)) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) elif hasattr(classifier, "predict_proba"): zz = np.array( [classifier.predict_proba(np.array([xi,yi]).reshape(1,-1))[:,1] for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) else : zz = np.array( [classifier(np.array([xi,yi]).reshape(1,-1)) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) # reshape result and plot Z = zz.reshape(xx.shape) cm_bright = ListedColormap(['#FF0000', '#0000FF']) #Get current axis and plot if ax is None: ax = plt.gca() ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5) ax.contour(xx, yy, Z, 2, cmap='RdBu') ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright, s = scatter_weights * 40) ax.set_xlabel('$X_1$') ax.set_ylabel('$X_2$')
def plot_AdaBoost_scratch_boundary(estimators,estimator_weights, X, y, N = 10,ax = None ): def AdaBoost_scratch_classify(x_temp, est,est_weights ): '''Return classification prediction for a given point X and a previously fitted AdaBoost''' temp_pred = np.asarray( [ (e.predict(x_temp)).T* w for e, w in zip(est,est_weights )] ) / est_weights.sum() return np.sign(temp_pred.sum(axis = 0)) '''Utility function to plot decision boundary and scatter plot of data''' x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
zz = np.array( [AdaBoost_scratch_classify(np.array([xi,yi]).reshape(1,-1), estimators,estimator_weights ) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) # reshape result and plot Z = zz.reshape(xx.shape) cm_bright = ListedColormap(['#FF0000', '#0000FF']) if ax is None: ax = plt.gca() ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5) ax.contour(xx, yy, Z, 2, cmap='RdBu') ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright) ax.set_xlabel('$X_1$') ax.set_ylabel('$X_2$')
Plotting resulting decision boundary
1
plot_AdaBoost_scratch_boundary(estimator_list, estimator_weight_list, X, y, N = 50 )
Viewing decision boundaries of the weak classifiers inside AdaBoost
$M = 10$ iterations
Size of scatter points is proportional to scale of sample_weights
For example, in the top left plot $m = 0$, 6 blue points are misclassified. In the next ieration $m=1$, their sample weight is increased and the scatter plot shows them as bigger.
The impact of Learning Rate L and the number of weak classifiers M
From the above algorithm we can understand intuitively that
Decreasing the learning rate $L$makes the coefficients $\alpha_m$ smaller, which reduces the amplitude of the sample_weights at each step (since $w_i \leftarrow w_i e^{ \alpha_m \mathcal{I}…} $ ). This translates into smaller variations of the weighted data points and therefore fewer differences between the weak classifier decision boundaries
Increasing the number of weak classifiers M increases the number of iterations, and allows the sample weights to gain greater amplitude. This translates into 1) more weak classifiers to combine at the end, and 2) more variations in the decision boundaries of these classifiers. Put together these effects tend to lead to more complex overall decision boundaries.
From this intuition, it would make sense to see a trade-off between the parameters $L$ and $M$. Increasing one and decreasing the other will tend to cancel the effect.
An example on a toy dataset
Here is a toy dataset used on a different question. Plotting the final decision boundary for different values of L and M shows there is some intuitive relation between the two.
Too small $L$ or $M$ leads to an overly simplistic boundary. Left hand side plots
Making one large and the other small tends to cancel the effect Plots in the middle
Making both large gives a good result but may overfit Right hand side plot
SentencePiece是在“SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing”这篇文章中介绍的。其主要是为了解决不同语言分词规则需要特别定义的问题,比如下面这种情况:
1 2 3
Raw text: Hello world. Tokenized: [Hello] [world] [.] Decoded text: Hello world .
# Tokenizers provides ultra-fast implementations of most current tokenizers: >>> from tokenizers import (ByteLevelBPETokenizer, CharBPETokenizer, SentencePieceBPETokenizer, BertWordPieceTokenizer) # Ultra-fast => they can encode 1GB of text in ~20sec on a standard server's CPU # Tokenizers can be easily instantiated from standard files >>> tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
# Tokenizers provide exhaustive outputs: tokens, mapping to original string, attention/special token masks. # They also handle model's max input lengths as well as padding (to directly encode in padded batches) >>> output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
>>> print(output.ids, output.tokens, output.offsets) [101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102] ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]'] [(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27),(28, 29), (0, 0)] # Here is an example using the offsets mapping to retrieve the string corresponding to the 10th token: >>> output.original_str[output.offsets[10]] '😁'
自己训练分词器
1 2 3 4 5 6
# You can also train a BPE/Byte-levelBPE/WordPiece vocabulary on your own files >>> tokenizer = ByteLevelBPETokenizer() >>> tokenizer.train(["wiki.test.raw"], vocab_size=20000) [00:00:00] Tokenize words ████████████████████████████████████████ 20993/20993 [00:00:00] Count pairs ████████████████████████████████████████ 20993/20993 [00:00:03] Compute merges ████████████████████████████████████████ 19375/19375
参考材料
这篇文章是在Floydhub的一篇博客基础上扩展的。还主要参考了Unigram的原论文,BPE的官方解释等。BPE的动态图来自于Toward data science的有关博客。除此之外,最后一章参考于tokenizers的官方仓库。
import os import torch import pandas as pd import torch.nn as nn import numpy as np import torch.nn.functional as F from torch.optim import lr_scheduler # tqdm用来记录notebook训练的进度条 from tqdm.autonotebook import tqdm
from sklearn import model_selection from sklearn import metrics import transformers import tokenizers from transformers import AdamW from transformers import get_linear_schedule_with_warmup
for ind in (i for i, e in enumerate(tweet) if e == selected_text[1]): if" " + tweet[ind: ind+len_st] == selected_text: idx0 = ind idx1 = ind + len_st - 1 break
char_targets = [0] * len(tweet) if idx0 != Noneand idx1 != None: for ct in range(idx0, idx1 + 1): char_targets[ct] = 1 tok_tweet = tokenizer.encode(tweet) input_ids_orig = tok_tweet.ids tweet_offsets = tok_tweet.offsets target_idx = [] for j, (offset1, offset2) in enumerate(tweet_offsets): if sum(char_targets[offset1: offset2]) > 0: target_idx.append(j) targets_start = target_idx[0] targets_end = target_idx[-1]
defloss_fn(start_logits, end_logits, start_positions, end_positions): """ Return the sum of the cross entropy losses for both the start and end logits """ loss_fct = nn.CrossEntropyLoss() start_loss = loss_fct(start_logits, start_positions) end_loss = loss_fct(end_logits, end_positions) total_loss = (start_loss + end_loss) return total_loss
deftrain_fn(data_loader, model, optimizer, device, scheduler=None): """ Trains the bert model on the twitter data """ # Set model to training mode (dropout + sampled batchnorm is activated) model.train() losses = utils.AverageMeter() jaccards = utils.AverageMeter()
# Set tqdm to add loading screen and set the length tk0 = tqdm(data_loader, total=len(data_loader)) # Train the model on each batch for bi, d in enumerate(tk0):
# Move ids, masks, and targets to gpu while setting as torch.long ids = ids.to(device, dtype=torch.long) token_type_ids = token_type_ids.to(device, dtype=torch.long) mask = mask.to(device, dtype=torch.long) targets_start = targets_start.to(device, dtype=torch.long) targets_end = targets_end.to(device, dtype=torch.long)
# Reset gradients model.zero_grad() # Use ids, masks, and token types as input to the model # Predict logits for each of the input tokens for each batch outputs_start, outputs_end = model( ids=ids, mask=mask, token_type_ids=token_type_ids, ) # (bs x SL), (bs x SL) # Calculate batch loss based on CrossEntropy loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end) # Calculate gradients based on loss loss.backward() # Adjust weights based on calculated gradients optimizer.step() # Update scheduler scheduler.step() # Apply softmax to the start and end logits # This squeezes each of the logits in a sequence to a value between 0 and 1, while ensuring that they sum to 1 # This is similar to the characteristics of "probabilities" outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy() outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy() # Calculate the jaccard score based on the predictions for this batch jaccard_scores = [] for px, tweet in enumerate(orig_tweet): selected_tweet = orig_selected[px] tweet_sentiment = sentiment[px] jaccard_score, _ = calculate_jaccard_score( original_tweet=tweet, # Full text of the px'th tweet in the batch target_string=selected_tweet, # Span containing the specified sentiment for the px'th tweet in the batch sentiment_val=tweet_sentiment, # Sentiment of the px'th tweet in the batch idx_start=np.argmax(outputs_start[px, :]), # Predicted start index for the px'th tweet in the batch idx_end=np.argmax(outputs_end[px, :]), # Predicted end index for the px'th tweet in the batch offsets=offsets[px] # Offsets for each of the tokens for the px'th tweet in the batch ) jaccard_scores.append(jaccard_score) # Update the jaccard score and loss # For details, refer to `AverageMeter` in https://www.kaggle.com/abhishek/utils jaccards.update(np.mean(jaccard_scores), ids.size(0)) losses.update(loss.item(), ids.size(0)) # Print the average loss and jaccard score at the end of each batch tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)
defcalculate_jaccard_score( original_tweet, target_string, sentiment_val, idx_start, idx_end, offsets, verbose=False): """ Calculate the jaccard score from the predicted span and the actual span for a batch of tweets """ # A span's start index has to be greater than or equal to the end index # If this doesn't hold, the start index is set to equal the end index (the span is a single token) if idx_end < idx_start: idx_end = idx_start # Combine into a string the tokens that belong to the predicted span filtered_output = "" for ix in range(idx_start, idx_end + 1): filtered_output += original_tweet[offsets[ix][0]: offsets[ix][1]] # If the token is not the last token in the tweet, and the ending offset of the current token is less # than the beginning offset of the following token, add a space. # Basically, add a space when the next token (word piece) corresponds to a new word if (ix+1) < len(offsets) and offsets[ix][1] < offsets[ix+1][0]: filtered_output += " "
# Set the predicted output as the original tweet when the tweet's sentiment is "neutral", or the tweet only contains one word if sentiment_val == "neutral"or len(original_tweet.split()) < 2: filtered_output = original_tweet
# Calculate the jaccard score between the predicted span, and the actual span # The IOU (intersection over union) approach is detailed in the utils module's `jaccard` function: # https://www.kaggle.com/abhishek/utils jac = utils.jaccard(target_string.strip(), filtered_output.strip()) return jac, filtered_output
defeval_fn(data_loader, model, device): """ Evaluation function to predict on the test set """ # Set model to evaluation mode # I.e., turn off dropout and set batchnorm to use overall mean and variance (from training), rather than batch level mean and variance # Reference: https://github.com/pytorch/pytorch/issues/5406 model.eval() losses = utils.AverageMeter() jaccards = utils.AverageMeter() # Turns off gradient calculations (https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch) with torch.no_grad(): tk0 = tqdm(data_loader, total=len(data_loader)) # Make predictions and calculate loss / jaccard score for each batch for bi, d in enumerate(tk0): ids = d["ids"] token_type_ids = d["token_type_ids"] mask = d["mask"] sentiment = d["sentiment"] orig_selected = d["orig_selected"] orig_tweet = d["orig_tweet"] targets_start = d["targets_start"] targets_end = d["targets_end"] offsets = d["offsets"].numpy()
# Predict logits for start and end indexes outputs_start, outputs_end = model( ids=ids, mask=mask, token_type_ids=token_type_ids ) # Calculate loss for the batch loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end) # Apply softmax to the predicted logits for the start and end indexes # This converts the "logits" to "probability-like" scores outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy() outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy() # Calculate jaccard scores for each tweet in the batch jaccard_scores = [] for px, tweet in enumerate(orig_tweet): selected_tweet = orig_selected[px] tweet_sentiment = sentiment[px] jaccard_score, _ = calculate_jaccard_score( original_tweet=tweet, target_string=selected_tweet, sentiment_val=tweet_sentiment, idx_start=np.argmax(outputs_start[px, :]), idx_end=np.argmax(outputs_end[px, :]), offsets=offsets[px] ) jaccard_scores.append(jaccard_score)
# Update running jaccard score and loss jaccards.update(np.mean(jaccard_scores), ids.size(0)) losses.update(loss.item(), ids.size(0)) # Print the running average loss and jaccard score tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg) print(f"Jaccard = {jaccards.avg}") return jaccards.avg
defrun(fold): """ Train model for a speciied fold """ # Read training csv dfx = pd.read_csv(config.TRAINING_FILE)
# Set train validation set split df_train = dfx[dfx.kfold != fold].reset_index(drop=True) df_valid = dfx[dfx.kfold == fold].reset_index(drop=True) # Instantiate TweetDataset with training data train_dataset = TweetDataset( tweet=df_train.text.values, sentiment=df_train.sentiment.values, selected_text=df_train.selected_text.values )
# Instantiate DataLoader with `train_dataset` # This is a generator that yields the dataset in batches train_data_loader = torch.utils.data.DataLoader( train_dataset, batch_size=config.TRAIN_BATCH_SIZE, num_workers=4 )
# Instantiate TweetDataset with validation data valid_dataset = TweetDataset( tweet=df_valid.text.values, sentiment=df_valid.sentiment.values, selected_text=df_valid.selected_text.values )
# Set device as `cuda` (GPU) device = torch.device("cuda") # Load pretrained BERT (bert-base-uncased) model_config = transformers.RobertaConfig.from_pretrained(config.ROBERTA_PATH) # Output hidden states # This is important to set since we want to concatenate the hidden states from the last 2 BERT layers model_config.output_hidden_states = True # Instantiate our model with `model_config` model = TweetModel(conf=model_config) # Move the model to the GPU model.to(device)
# Calculate the number of training steps num_train_steps = int(len(df_train) / config.TRAIN_BATCH_SIZE * config.EPOCHS) # Get the list of named parameters param_optimizer = list(model.named_parameters()) # Specify parameters where weight decay shouldn't be applied no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] # Define two sets of parameters: those with weight decay, and those without optimizer_parameters = [ {'params': [p for n, p in param_optimizer ifnot any(nd in n for nd in no_decay)], 'weight_decay': 0.001}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}, ] # Instantiate AdamW optimizer with our two sets of parameters, and a learning rate of 3e-5 optimizer = AdamW(optimizer_parameters, lr=3e-5) # Create a scheduler to set the learning rate at each training step # "Create a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period." (https://pytorch.org/docs/stable/optim.html) # Since num_warmup_steps = 0, the learning rate starts at 3e-5, and then linearly decreases at each training step scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=0, num_training_steps=num_train_steps )
# Apply early stopping with patience of 2 # This means to stop training new epochs when 2 rounds have passed without any improvement es = utils.EarlyStopping(patience=2, mode="max") print(f"Training is Starting for fold={fold}") # I'm training only for 3 epochs even though I specified 5!!! for epoch in range(5): train_fn(train_data_loader, model, optimizer, device, scheduler=scheduler) jaccard = eval_fn(valid_data_loader, model, device) print(f"Jaccard Score = {jaccard}") es(jaccard, model, model_path=f"model_{fold}.bin") if es.early_stop: print("Early stopping") break
# Load each of the five trained models and move to GPU model1 = TweetModel(conf=model_config) model1.to(device) model1.load_state_dict(torch.load("model_0.bin")) model1.eval()
# Instantiate TweetDataset with the test data test_dataset = TweetDataset( tweet=df_test.text.values, sentiment=df_test.sentiment.values, selected_text=df_test.selected_text.values )
# Predict start and end logits for each of the five models outputs_start1, outputs_end1 = model1( ids=ids, mask=mask, token_type_ids=token_type_ids ) outputs_start2, outputs_end2 = model2( ids=ids, mask=mask, token_type_ids=token_type_ids ) outputs_start3, outputs_end3 = model3( ids=ids, mask=mask, token_type_ids=token_type_ids ) outputs_start4, outputs_end4 = model4( ids=ids, mask=mask, token_type_ids=token_type_ids ) outputs_start5, outputs_end5 = model5( ids=ids, mask=mask, token_type_ids=token_type_ids ) # Get the average start and end logits across the five models and use these as predictions # This is a form of "ensembling" outputs_start = ( outputs_start1 + outputs_start2 + outputs_start3 + outputs_start4 + outputs_start5 ) / 5 outputs_end = ( outputs_end1 + outputs_end2 + outputs_end3 + outputs_end4 + outputs_end5 ) / 5 # Apply softmax to the predicted start and end logits outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy() outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
# Convert the start and end scores to actual predicted spans (in string form) for px, tweet in enumerate(orig_tweet): selected_tweet = orig_selected[px] tweet_sentiment = sentiment[px] _, output_sentence = calculate_jaccard_score( original_tweet=tweet, target_string=selected_tweet, sentiment_val=tweet_sentiment, idx_start=np.argmax(outputs_start[px, :]), idx_end=np.argmax(outputs_end[px, :]), offsets=offsets[px] ) final_output.append(output_sentence)
# post-process trick: # Note: This trick comes from: https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/140942 # When the LB resets, this trick won't help defpost_process(selected): return" ".join(set(selected.lower().split()))
class AverageMeter:
"""
Computes and stores the average and current value
"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count