Converting to Transactions in R for arules library

Hi All,

It has been a long time; Today I will provide a simple snippet which helped me in Association rules mining in R. Typically when you are extracting transaction data from a Database, it will not be in a transaction level. It is in an item level.

Capture

But the way any self respecting data modeller will store them in RDBMS is

Capture

And the arules library in R requires the datastructure in a “Transactions” class, which is a fancy name for a matrix. So, below is the snippet to convert tabular list to matrix and then to transactions class;


# converts a list/frame to matrix

train.mat <- acast(train, account~product, value.var = "count")

# zero all NAs

which(is.na(train.mat))
for(i in 1:nrow(train.mat))
{
for(j in 1:ncol(train.mat)){
if(is.na(train.mat[i,j]))
{
train.mat[i,j]=0
}
}
}

# converts matrix to transactions

train.trans <- as(train.mat, "transactions")

Hope this helps!!!

Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

Parallel Looping, Multi-Core processing in R

Hi,

It has been as while since I last posted anything. If you know me, you know that I rant over loops in R. Vectorization was everything to me. I frequently avoided using Loops in R.

On a kaggle challenge, I had to process audio files. Feature extraction of these audio files gave me time series and trying to compare two large time series is a cpu intensive task. For my code which was doing distance optimization, Vectorization just didn’t cut it. Vectorization was taking a long time.

Finally, I read about parallel processing using foreach loops, and implementing them lead my optimization steps to run quickly. Parallel loops ran quicker than the vectorization code that I had written. The results were a relief. Then I read about doMC which helps utilize the multi-core processing ability of your CPU. Then I found that doMC is not available for windows. Fortunately there is doSNOW for windows that does the same thing. This improved my time even more.

For my code(which is distance optimization) the numbers are below;

Vectorization(sapply) -> 30 mins

only Foreach loops -> 22 mins

doSNOW + Foreach loops -> 18 mins

Below is a snippet of code that employs Foreach and doSNOW;


# loading libraries
library(doSNOW)
library(foreach)

# initiating cores, my
# machine has a dual core processor
my.clusters<-makeCluster(2)
registerDoSNOW(my.clusters)

i <- 0

# running parallel loops on both cores independantly
# here we are training 16 trees in parallel, on two seperate cores of my CPU
rpart.trees <- foreach(i=1:16, .packages = 'rpart')%dopar%{
data.train <- rbind(train_set1sep[which(train_set1sep$row.sep==i),-c(11)],train_set0)
rpart(ACTION~.,data.train)
}

PS: Remember that this methodology only works for code that can be executed in parallel. Think Map-Reduce with data replication factor of 1.

Reading humungous csv files in R

Hi,

The other day, for kaggle I had to read a csv file into R. It was around 4GB in size. None of the text editors were able to open the file. Even Access text import failed. I had to work with a macro or had to look at open source RDBMS like MySql. Luckily, my teammate gave me the idea of using file stream reader and writers in Java, Also my manager showed me this link.

I didn’t need to read the entire file, for initial experiments I needed only a part of the file. So, the below is a code snippet to get a part of the csv;

# initializing file readers
x<-file('extra_unsupervised_data.csv','rt')
x
y<-file('unsupervised_trim.csv','wt')
y

# reading lines
line<-readLines(x,n=10001)

# writing lines
cat(line,file='unsupervised_trim.csv',fill=TRUE)

# closing files
close(x)
close(y)

May this snippet save you some time!!!!!

R SMOTE Function – Reminder

SMOTE is a wonderful function in R. It belongs to the package DMwR. It does something very unique. It helps in filling under balanced classes. If your data is imbalanced, SMOTE helps generate more examples of the imbalanced class using a argument specified value based nearest neighbors. There’s one catch though. It cannot work with multi class data. It works only with two classes. You will need to do some special data transformations to smote it.

Try testing your multi class data with the below syntax;

train_cl1s<-SMOTE(V1 ~ ., train_cl1, perc.over = 1500, perc.under = 0, k=11)

now try to check the resultant data frame. It will consist only of one class and not more than it as SMOTE recognizes only two classes.

Repeating Numbers in R without Loop structures

hello,

The other day, I needed to repeat a values based on a sequence of another values; for e.g., a array of a,b,c needed to be repeated based on another array 3,2,1 and output needed to be a,a,a,b,b,c.

I started my R code with “for”…. then suddenly a booming voice echoed “If you are using “for” in R you aren’t using it properly”. So then I started thinking on how to do this without a loop;

Firstly, the code to repeat values based on a sequence of numbers using loops;

base<-c("a","b","c")
repetition<-c(3,2,1)
i<-1
x<-1:length(repetition)
new<-0
for(i in x){
new<-c(new,rep(base[i],repetition[i]))
i=i+1
}
new<-new[-c(1)]
new

Finally, the code to repeat values without sequence. The code is much cleaner, shorter and sweeter;

base<-c("a","b","c")
repetition<-c(3,2,1)
x<-data.frame(cbind(base,repetition))
x$repetition<-as.numeric(x$repetition)
rep_final<-rep(x$base,x$repetition)
rep_final

You can also use the na.locf function from zoo library for such cases. Say NO to Loops!!!!!!!

Text Mining – Data Preprocessing using R – Part 1

For Text Mining Projects, The most important and time consuming process is the Data Pre-processing phase. In this phase, you cleanse, validate and transform the training data for your Machine Learning Algorithm. Improper Pre-processing of data leads to reduction in accuracy, over-fitting/under-fitting, bias/variance in the algorithms. All the pre-processing done for the Training data will be done to test data and the incoming online data as well, to make sure the data is in the same form which can be consumed by the algorithms. There are many steps in Data Pre-processing and not all may be needed in every project. For e.g.,  Spell Checking may not be needed for Machine generated data assuming the machine will not make mistakes, Parts of Speech Tagging may not be needed for Document genre classification, Abbreviation Handling may not be need in certain cases. Below are some of the text pre-processing steps with relevant R code.

1. Data Loading;

Data can be loaded into R from many sources. Functions for reading files, accessing Database tables are available in R Libraries.

#Loading source data
source_data<-read.csv("source.csv",header=FALSE)

2. Text Normalization;

In the example of Review Classification, text like 3 stars, 5 stars, 2 star can be normalized to something like ‘good’, ‘bad’, ‘average’ before tokenization or bag of words. Another example is words like ‘haven’t’, ‘shouldn’t’ etc., can be reduced to ‘not’. This would reduce the token count and complexity of handling such tokens. Tokenizing 2.5 stars would give you two tokens 2.5 and stars and may get interpreted wrongly by the machine learning algo.

source_data<-sapply(source_data,
function(x) sub("4 stars","good",x))
source_data<-sapply(source_data,
function(x) sub("1 star","bad",x))

# replacing words with n't
# not.txt is a file with words that can be replaced with not
not<-read.csv("c:/not.txt",header=FALSE,sep="\n")
repNot <- function(x) {
i<-1
while(i<=length(not[,1]))
{
x<-sub("havent","not",x)
i<-i+1
}
return (x)
}
source_data<-repNot(source_data)

3. Abbreviations;

Sometimes handling abbreviations makes sense, certain scenarios may need expansion of abbreviations and some may need creation of abbreviations. Whatever the case, the below snippet can be used by just reversing the abbreviations and expansions.

# setting up an abbreviation lookup
# reverse the vector for expansion logic
abbrev<-c("lol","omg","imo","gtfo")
abbrev<-cbind(abbrev,c("laugh out loud","oh my god",
"in my opinion","get the out"))
abbrev

# function for abbreviations
findAbbrev <- function(x) {
i<-1
while(i<=length(abbrev[,1]))
{
x<-sub(abbrev[i,1],abbrev[i,2],x)
i<-i+1
}
return (x)
}

text <- "Lol, OMG, IMO this is not a drill, haven't you heard"
text <- rbind(text,"LOL, Am I wrong, GTFO")
text<- Corpus(VectorSource(text),
readercontrol=list(reader=readPlain))
text <- tm_map(text , tolower)
text <- tm_map(text , removePunctuation)
text

# text<-sapply(text,function(x) findAbbrev(x))
text<-findAbbrev(text)
text

to be continued…,

Building Hellinger Distance Decision Trees

If you have stumbled on this page, I assume there is still no Package in R for Hellinger Decision Trees.

When I wanted to build a model using Hellinger Tree, my natural instinct was to search for a package in R. When I found none, I searched for its implementation in any other Language, I couldn’t find any implementation snippets. So, this is my post that could save your valuable time.

So, using this jar, I had updated the base Weka jar and found the Hellinger Tree under the classify tab in Weka Explorer but couldn’t work with it somehow, Maybe because I was new to Weka. I didn’t have the time to learn Weka and I needed to build the model ASAP with only hellinger distance decision tree and quickly. I used the jar to build the tree in our old dependable friend JAVA.

Weka classifiers work best with .arff files. Data can be supplied to them in other formats but arff files are native to Weka. We can use RWeka package in R to export training, testing and other datasets in .arff files.

library(RWeka)
write.arff(data_train,"train.arff",eol = "\n")

Now for the sample JAVA code for Hellinger Tree based Model

import weka.classifiers.Evaluation;
import weka.classifiers.trees.HTree;
import weka.core.Instance;
import weka.core.Instances;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class hellingerDT {

/**
* @param args
* @throws FileNotFoundException
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub

//Reading data from training, test, classify arff files
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("location/train.arff"));
Instances train = new Instances (breader);
//Setting the first column as class variable, Indicating to the model that this is the
train.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/test.arff"));
Instances test = new Instances (breader);
test.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/classify.arff"));
Instances classify = new Instances (breader);
classify.setClassIndex(0);

breader.close();

//Instantiate a Hellinger Tree model
HTree hT = new HTree();

//Train the model
hT.buildClassifier(train);

//Evaluating a model, using test data
Evaluation eval = new Evaluation(train);
eval.evaluateModel(hT, test);
//Display the metrics
System.out.println(eval.toSummaryString("Review Classification Hellinger Tree", true));
System.out.println("Precision " + eval.precision(1)*100+" and Recall "+eval.recall(1)*100);

//Printing the Tree
System.out.println(hT.graph());

//Classifying New Data
for (int i = 0; i < classify.numInstances(); i++) {
double pred = hT.classifyInstance(classify.instance(i));
System.out.println(pred);
}

}

}