Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

Parallel Looping, Multi-Core processing in R

Hi,

It has been as while since I last posted anything. If you know me, you know that I rant over loops in R. Vectorization was everything to me. I frequently avoided using Loops in R.

On a kaggle challenge, I had to process audio files. Feature extraction of these audio files gave me time series and trying to compare two large time series is a cpu intensive task. For my code which was doing distance optimization, Vectorization just didn’t cut it. Vectorization was taking a long time.

Finally, I read about parallel processing using foreach loops, and implementing them lead my optimization steps to run quickly. Parallel loops ran quicker than the vectorization code that I had written. The results were a relief. Then I read about doMC which helps utilize the multi-core processing ability of your CPU. Then I found that doMC is not available for windows. Fortunately there is doSNOW for windows that does the same thing. This improved my time even more.

For my code(which is distance optimization) the numbers are below;

Vectorization(sapply) -> 30 mins

only Foreach loops -> 22 mins

doSNOW + Foreach loops -> 18 mins

Below is a snippet of code that employs Foreach and doSNOW;


# loading libraries
library(doSNOW)
library(foreach)

# initiating cores, my
# machine has a dual core processor
my.clusters<-makeCluster(2)
registerDoSNOW(my.clusters)

i <- 0

# running parallel loops on both cores independantly
# here we are training 16 trees in parallel, on two seperate cores of my CPU
rpart.trees <- foreach(i=1:16, .packages = 'rpart')%dopar%{
data.train <- rbind(train_set1sep[which(train_set1sep$row.sep==i),-c(11)],train_set0)
rpart(ACTION~.,data.train)
}

PS: Remember that this methodology only works for code that can be executed in parallel. Think Map-Reduce with data replication factor of 1.