How to apply XGBOOST algorithm using R

November 12, 2021

Extreme Gradient Boosting (xgboost) is a powerful machine learning algorithm and is one of the popular winning recipes. Over the past few years, predictive modeling has been termed as fast and accurate. Most of the tedious work can be done using this algorithm. Compared to the random forest and neural networks, XGBOOST has better efficiency, accuracy, and feasibility. The latest implementation on xgboost was launched in August 2015. Also, it is 10 times faster than existing gradient boosting implementations.

How to build models using Xgboost on R:

1. Load all the libraries

				
					library(xgboost) 
library(readr) 
library(stringr) 
library(caret) 
library(car)

2. Then load the dataset:
For instance, I have taken bank data to check whether the customer is eligible for a loan or not.

				
					set.seed(100) 
setwd("C:\\Users\\ts93856\\Desktop\\datasource") 
# load data 
df_train = read_csv("train_users_2.csv") 
df_test = read_csv("test_users.csv") 

# Loading labels of train data 

labels = df_train['labels'] 
df_train = df_train[-grep('labels', colnames(df_train))] 
# combine train and test data 
df_all = rbind(df_train,df_test)

3. Cleaning the data & featuring it:

				
					# clean Variables :  here I clean people with age less than 14 or more than 100 
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1 
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0]) 

# one-hot-encoding categorical features 

ohe_feats = c('gender', 'education', 'employer') 
dummies <- dummyVars(~ gender +  education + employer, data = df_all) 
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all)) 
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.fa

4. Now we have to test and run the model:

				
					xgb <- xgboost(data = data.matrix(X[,-1]),  
label = y,  
eta = 0.1, 
max_depth = 15,  
nround=25,  
subsample = 0.5, 
colsample_bytree = 0.5, 
seed = 1, 
eval_metric = "merror", 
objective = "multi:softprob", 
num_class = 12, 
nthread = 3 
)

5. The last step is to score the test population:
Here you have an object “xgb,” an xgboost model. This is the way to score the test population.

				
					# predict values in test set 
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

Parameters can be used in xgboost:

There are 3 types of parameters:

1. General Parameters – Using to do boosting and the model used are tree or linear model.
2. Booster parameters – Depends on which booster has been chosen.
3. Learning Task parameters – Helps to decide on learning scenarios, like regression tasks.

Advanced functionality of xgboost:

Implementing xgboost is really simple compared to other machine learning techniques. We already have a model, as shown above.

Now let’s find the variable importance in the model and subset our variable list:

				
					# Lets start with finding what the actual tree looks like 
model <- xgb.dump(xgb, with.stats = T) 
model[1:10] #This statement prints top 10 nodes of the model 

# Get the feature real names 
names <- dimnames(data.matrix(X[,-1]))[[2]] 

# Compute feature importance matrix 
importance_matrix <- xgb.importance(names, model = xgb) 

# Nice graph 
xgb.plot.importance(importance_matrix[1:10,]) 

#In case last step does not work for you because of a version issue, you can try following : 
barplot(importance_matrix[,1])

As you can see, there are many variables not worth using in the model, and we are free to remove those variables and run the model again so that we can expect a better accuracy.

Testing whether the result makes sense:

Let’s say age is the variable and the most important one, and there is a simple chi-square test to check whether the variable is important or not.

				
					test <- chisq.test(train$Age, output_vector) 
print(test)

The same process can be done for all the important variables. And we can identify whether the model has identified the important variables or not.

Success Stories

Testimonials

AI TRANSFORMATION

DATA + AI

CLOUD + INFRASTRUCTURE

CYBERSECURITY

PRODUCT EXPERIENCE

MODERN BUSINESS APPS

Enterprise Pharma AI

Financial Services Challenges

Retail Services

Public Sector Services

Pharma Marketing Manufacturing

MANUFACTURING

Hosting Services

ENERGY + UTILITIES

Veeva Vault Support Services

FINANCIAL SERVICES + INSURANCE

Table of Contents

How to apply XGBOOST algorithm using R

How to build models using Xgboost on R:

Parameters can be used in xgboost:

Advanced functionality of xgboost:

Testing whether the result makes sense:

Liked what you read !

Please leave a Feedback

Leave a Reply Cancel reply

Join the sustainability movement

Calculate Your DataOps ROI with Ease!

Related articles you may would like to read

How Cloud & AI Are Transforming Pharma Marketing Agencies

AI Knowledge Search Agents: Unlocking Enterprise Intelligence in Banking

Building a Secure, Vendor-Agnostic AI Foundation for Modern Banking

End to End Solution

Services

AI Powered Services

Professional Services

Partners

Resources