Table of Contents

How to apply XGBOOST algorithm using R

Apply XGBOOST algorithm using R

Extreme Gradient Boosting (xgboost) is a powerful machine learning algorithm and is one of the popular winning recipes. Over the past few years, predictive modeling has been termed as fast and accurate. Most of the tedious work can be done using this algorithm. Compared to the random forest and neural networks, XGBOOST has better efficiency, accuracy, and feasibility. The latest implementation on xgboost was launched in August 2015. Also, it is 10 times faster than existing gradient boosting implementations. 

How to build models using Xgboost on R:

1. Load all the libraries 

				
					library(xgboost) 
library(readr) 
library(stringr) 
library(caret) 
library(car) 
				
			

2. Then load the dataset: 
For instance, I have taken bank data to check whether the customer is eligible for a loan or not. 

				
					set.seed(100) 
setwd("C:\\Users\\ts93856\\Desktop\\datasource") 
# load data 
df_train = read_csv("train_users_2.csv") 
df_test = read_csv("test_users.csv") 

# Loading labels of train data 

labels = df_train['labels'] 
df_train = df_train[-grep('labels', colnames(df_train))] 
# combine train and test data 
df_all = rbind(df_train,df_test) 
				
			

3. Cleaning the data & featuring it: 

				
					# clean Variables :  here I clean people with age less than 14 or more than 100 
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1 
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0]) 

# one-hot-encoding categorical features 

ohe_feats = c('gender', 'education', 'employer') 
dummies <- dummyVars(~ gender +  education + employer, data = df_all) 
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all)) 
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.fa 
				
			

4. Now we have to test and run the model: 

				
					xgb <- xgboost(data = data.matrix(X[,-1]),  
label = y,  
eta = 0.1, 
max_depth = 15,  
nround=25,  
subsample = 0.5, 
colsample_bytree = 0.5, 
seed = 1, 
eval_metric = "merror", 
objective = "multi:softprob", 
num_class = 12, 
nthread = 3 
) 
				
			

5. The last step is to score the test population: 
Here you have an object “xgb,” an xgboost model. This is the way to score the test population. 

				
					# predict values in test set 
y_pred <- predict(xgb, data.matrix(X_test[,-1])) 
				
			

Parameters can be used in xgboost:

There are 3 types of parameters:

1. General Parameters – Using to do boosting and the model used are tree or linear model.  
2. Booster parameters – Depends on which booster has been chosen.  
3. Learning Task parameters – Helps to decide on learning scenarios, like regression tasks. 

Advanced functionality of xgboost:

Implementing xgboost is really simple compared to other machine learning techniques. We already have a model, as shown above.  

Now let’s find the variable importance in the model and subset our variable list: 

				
					# Lets start with finding what the actual tree looks like 
model <- xgb.dump(xgb, with.stats = T) 
model[1:10] #This statement prints top 10 nodes of the model 

# Get the feature real names 
names <- dimnames(data.matrix(X[,-1]))[[2]] 

# Compute feature importance matrix 
importance_matrix <- xgb.importance(names, model = xgb) 

# Nice graph 
xgb.plot.importance(importance_matrix[1:10,]) 

#In case last step does not work for you because of a version issue, you can try following : 
barplot(importance_matrix[,1]) 
				
			

As you can see, there are many variables not worth using in the model, and we are free to remove those variables and run the model again so that we can expect a better accuracy.  

Testing whether the result makes sense:

Let’s say age is the variable and the most important one, and there is a simple chi-square test to check whether the variable is important or not. 

				
					test <- chisq.test(train$Age, output_vector) 
print(test) 
				
			

The same process can be done for all the important variables. And we can identify whether the model has identified the important variables or not. 

Liked what you read !

Please leave a Feedback

Leave a Reply

Your email address will not be published. Required fields are marked *

Join the sustainability movement

Is your carbon footprint leaving a heavy mark? Learn how to lighten it! ➡️

Register Now

Calculate Your DataOps ROI with Ease!

Simplify your decision-making process with the DataOps ROI Calculator, optimize your data management and analytics capabilities.

Calculator ROI Now!

Related articles you may would like to read

The Transformative Power of Artificial Intelligence in Healthcare
How To Setup An AI Center of Excellence (COE) With Use Cases And Process 
Proposals

Know the specific resource requirement for completing a specific project with us.

Blog

Keep yourself updated with the latest updates about Cloud technology, our latest offerings, security trends and much more.

Webinar

Gain insights into latest aspects of cloud productivity, security, advanced technologies and more via our Virtual events.

ISmile Technologies delivers business-specific Cloud Solutions and Managed IT Services across all major platforms maximizing your competitive advantage at an unparalleled value.

Request a Consultation