Hi,
Many of you might have converted numeric to categorical variables using cut function in the training set. And once the decision tree is built, it is time to run the decision tree on Test data set. But the test data set still has continuous measures. If you have binned the data set before splitting it may cause an issue because your model has seen the test data set and that is not optimal.
Below is an R function I have written to bin a continuous variable based on an already binned variable;
bin.test.measures <- function(levels, measure){
level.frame <- data.frame("",0,0)
names(level.frame) <- c("level","lower","upper")
for(level in levels){
lowermatch <- str_match(level, pattern = '[\\(\\[].+,')
lower <- as.numeric(str_sub(lowermatch, 2, str_length(lowermatch)-1))
uppermatch <- str_match(level, pattern = ',.+[\\)\\]]')
upper <- as.numeric(str_sub(uppermatch, 2, str_length(uppermatch)-1))
temp <- data.frame(level, lower, upper)
names(temp) <- names(level.frame)
level.frame <- rbind(level.frame, temp)
}
level.frame <- level.frame[-1,]
binned <- c(as.factor("T"))
for(number in measure){
for(i in 1:nrow(level.frame)){
if(i==1){
if(number>=level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i < nrow(level.frame)) {
if(number>level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i == nrow(level.frame)) {
if(number>level.frame[i,2]){
binned <- c(binned, as.character(level.frame[i,1]))
}
}
}
}
binned <- binned[-1]
binned
}
Hope this helps!!!

