Hi,
It has been as while since I last posted anything. If you know me, you know that I rant over loops in R. Vectorization was everything to me. I frequently avoided using Loops in R.
On a kaggle challenge, I had to process audio files. Feature extraction of these audio files gave me time series and trying to compare two large time series is a cpu intensive task. For my code which was doing distance optimization, Vectorization just didn’t cut it. Vectorization was taking a long time.
Finally, I read about parallel processing using foreach loops, and implementing them lead my optimization steps to run quickly. Parallel loops ran quicker than the vectorization code that I had written. The results were a relief. Then I read about doMC which helps utilize the multi-core processing ability of your CPU. Then I found that doMC is not available for windows. Fortunately there is doSNOW for windows that does the same thing. This improved my time even more.
For my code(which is distance optimization) the numbers are below;
Vectorization(sapply) -> 30 mins
only Foreach loops -> 22 mins
doSNOW + Foreach loops -> 18 mins
Below is a snippet of code that employs Foreach and doSNOW;
# loading libraries
library(doSNOW)
library(foreach)
# initiating cores, my
# machine has a dual core processor
my.clusters<-makeCluster(2)
registerDoSNOW(my.clusters)
i <- 0
# running parallel loops on both cores independantly
# here we are training 16 trees in parallel, on two seperate cores of my CPU
rpart.trees <- foreach(i=1:16, .packages = 'rpart')%dopar%{
data.train <- rbind(train_set1sep[which(train_set1sep$row.sep==i),-c(11)],train_set0)
rpart(ACTION~.,data.train)
}
PS: Remember that this methodology only works for code that can be executed in parallel. Think Map-Reduce with data replication factor of 1.