Binning Measure in Test Dataset based on Training Dataset Levels

Hi,

Many of you might have converted numeric to categorical variables using cut function in the training set. And once the decision tree is built, it is time to run the decision tree on Test data set. But the test data set still has continuous measures. If you have binned the data set before splitting it may cause an issue because your model has seen the test data set and that is not optimal.

Below is an R function I have written to bin a continuous variable based on an already binned variable;


bin.test.measures <- function(levels, measure){
level.frame <- data.frame("",0,0)
names(level.frame) <- c("level","lower","upper")
for(level in levels){
lowermatch <- str_match(level, pattern = '[\\(\\[].+,')
lower <- as.numeric(str_sub(lowermatch, 2, str_length(lowermatch)-1))
uppermatch <- str_match(level, pattern = ',.+[\\)\\]]')
upper <- as.numeric(str_sub(uppermatch, 2, str_length(uppermatch)-1))
temp <- data.frame(level, lower, upper)
names(temp) <- names(level.frame)
level.frame <- rbind(level.frame, temp)
}
level.frame <- level.frame[-1,]
binned <- c(as.factor("T"))
for(number in measure){
for(i in 1:nrow(level.frame)){
if(i==1){
if(number>=level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i < nrow(level.frame)) {
if(number>level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i == nrow(level.frame)) {
if(number>level.frame[i,2]){
binned <- c(binned, as.character(level.frame[i,1]))
}
}
}
}
binned <- binned[-1]
binned
}

Hope this helps!!!

Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

Reading humungous csv files in R

Hi,

The other day, for kaggle I had to read a csv file into R. It was around 4GB in size. None of the text editors were able to open the file. Even Access text import failed. I had to work with a macro or had to look at open source RDBMS like MySql. Luckily, my teammate gave me the idea of using file stream reader and writers in Java, Also my manager showed me this link.

I didn’t need to read the entire file, for initial experiments I needed only a part of the file. So, the below is a code snippet to get a part of the csv;

# initializing file readers
x<-file('extra_unsupervised_data.csv','rt')
x
y<-file('unsupervised_trim.csv','wt')
y

# reading lines
line<-readLines(x,n=10001)

# writing lines
cat(line,file='unsupervised_trim.csv',fill=TRUE)

# closing files
close(x)
close(y)

May this snippet save you some time!!!!!

R SMOTE Function – Reminder

SMOTE is a wonderful function in R. It belongs to the package DMwR. It does something very unique. It helps in filling under balanced classes. If your data is imbalanced, SMOTE helps generate more examples of the imbalanced class using a argument specified value based nearest neighbors. There’s one catch though. It cannot work with multi class data. It works only with two classes. You will need to do some special data transformations to smote it.

Try testing your multi class data with the below syntax;

train_cl1s<-SMOTE(V1 ~ ., train_cl1, perc.over = 1500, perc.under = 0, k=11)

now try to check the resultant data frame. It will consist only of one class and not more than it as SMOTE recognizes only two classes.

Repeating Numbers in R without Loop structures

hello,

The other day, I needed to repeat a values based on a sequence of another values; for e.g., a array of a,b,c needed to be repeated based on another array 3,2,1 and output needed to be a,a,a,b,b,c.

I started my R code with “for”…. then suddenly a booming voice echoed “If you are using “for” in R you aren’t using it properly”. So then I started thinking on how to do this without a loop;

Firstly, the code to repeat values based on a sequence of numbers using loops;

base<-c("a","b","c")
repetition<-c(3,2,1)
i<-1
x<-1:length(repetition)
new<-0
for(i in x){
new<-c(new,rep(base[i],repetition[i]))
i=i+1
}
new<-new[-c(1)]
new

Finally, the code to repeat values without sequence. The code is much cleaner, shorter and sweeter;

base<-c("a","b","c")
repetition<-c(3,2,1)
x<-data.frame(cbind(base,repetition))
x$repetition<-as.numeric(x$repetition)
rep_final<-rep(x$base,x$repetition)
rep_final

You can also use the na.locf function from zoo library for such cases. Say NO to Loops!!!!!!!

Hbase 101 and Tutorial

Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.

The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.

Advantages of Columnar databases;

1. Very good for Sparse Data scenarios

2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)

3. Store Version history by default. Built with inherent SCD capability.

Limitations of Hbase;

1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.

Below are some of the important commands to get started;

list –> Lists all the tables in the HBase DB

create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.

e.g., create ‘tx’, ‘cf1’

describe –> Used to display Table stats.

e.g., describe ‘tx’

put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.

e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′

put ‘tx’,’row1′ ,’cf1:col2′,’col2′

get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.

e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

scan –> display specifications of the said table

e.g., scan ‘tx’

disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.

e.g., disable ‘tx’

enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.

e.g., enable ‘tx’

drop –> deleting the said table.

e.g., drop ‘tx’

Hadoop Walkthough;

Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;

Inserting/Updating rows into the table;

put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘

put ‘tx’,’row2′ ,’cf1:col1′,’col21′

In the below command line we are adding a new column to cf1 family during run time.

put ‘tx’,’row2′ ,’cf1:col3′,’col31′

Retrieving data entirely/column specific from the table;

get ‘tx’,’row2′

get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

Schema Reduction Example;

Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.

create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}

put ’employee’,’emp1′,’details:name’,’emp_1′

put ’employee’,’emp1′,’details:id’,1

put ’employee’,’emp1′,’team:id’,10

put ’employee’,’emp1′,’team:name’,’coke’

put ’employee’,’emp1′,’team:name’,’pepsi’

get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}

disable ’employee’

alter ’employee’, ‘financial’

enable ’employee’

get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}

that is Hbase at 30,000 feet…..

Text Mining – Data Preprocessing using R – Part 1

For Text Mining Projects, The most important and time consuming process is the Data Pre-processing phase. In this phase, you cleanse, validate and transform the training data for your Machine Learning Algorithm. Improper Pre-processing of data leads to reduction in accuracy, over-fitting/under-fitting, bias/variance in the algorithms. All the pre-processing done for the Training data will be done to test data and the incoming online data as well, to make sure the data is in the same form which can be consumed by the algorithms. There are many steps in Data Pre-processing and not all may be needed in every project. For e.g.,  Spell Checking may not be needed for Machine generated data assuming the machine will not make mistakes, Parts of Speech Tagging may not be needed for Document genre classification, Abbreviation Handling may not be need in certain cases. Below are some of the text pre-processing steps with relevant R code.

1. Data Loading;

Data can be loaded into R from many sources. Functions for reading files, accessing Database tables are available in R Libraries.

#Loading source data
source_data<-read.csv("source.csv",header=FALSE)

2. Text Normalization;

In the example of Review Classification, text like 3 stars, 5 stars, 2 star can be normalized to something like ‘good’, ‘bad’, ‘average’ before tokenization or bag of words. Another example is words like ‘haven’t’, ‘shouldn’t’ etc., can be reduced to ‘not’. This would reduce the token count and complexity of handling such tokens. Tokenizing 2.5 stars would give you two tokens 2.5 and stars and may get interpreted wrongly by the machine learning algo.

source_data<-sapply(source_data,
function(x) sub("4 stars","good",x))
source_data<-sapply(source_data,
function(x) sub("1 star","bad",x))

# replacing words with n't
# not.txt is a file with words that can be replaced with not
not<-read.csv("c:/not.txt",header=FALSE,sep="\n")
repNot <- function(x) {
i<-1
while(i<=length(not[,1]))
{
x<-sub("havent","not",x)
i<-i+1
}
return (x)
}
source_data<-repNot(source_data)

3. Abbreviations;

Sometimes handling abbreviations makes sense, certain scenarios may need expansion of abbreviations and some may need creation of abbreviations. Whatever the case, the below snippet can be used by just reversing the abbreviations and expansions.

# setting up an abbreviation lookup
# reverse the vector for expansion logic
abbrev<-c("lol","omg","imo","gtfo")
abbrev<-cbind(abbrev,c("laugh out loud","oh my god",
"in my opinion","get the out"))
abbrev

# function for abbreviations
findAbbrev <- function(x) {
i<-1
while(i<=length(abbrev[,1]))
{
x<-sub(abbrev[i,1],abbrev[i,2],x)
i<-i+1
}
return (x)
}

text <- "Lol, OMG, IMO this is not a drill, haven't you heard"
text <- rbind(text,"LOL, Am I wrong, GTFO")
text<- Corpus(VectorSource(text),
readercontrol=list(reader=readPlain))
text <- tm_map(text , tolower)
text <- tm_map(text , removePunctuation)
text

# text<-sapply(text,function(x) findAbbrev(x))
text<-findAbbrev(text)
text

to be continued…,

Building Hellinger Distance Decision Trees

If you have stumbled on this page, I assume there is still no Package in R for Hellinger Decision Trees.

When I wanted to build a model using Hellinger Tree, my natural instinct was to search for a package in R. When I found none, I searched for its implementation in any other Language, I couldn’t find any implementation snippets. So, this is my post that could save your valuable time.

So, using this jar, I had updated the base Weka jar and found the Hellinger Tree under the classify tab in Weka Explorer but couldn’t work with it somehow, Maybe because I was new to Weka. I didn’t have the time to learn Weka and I needed to build the model ASAP with only hellinger distance decision tree and quickly. I used the jar to build the tree in our old dependable friend JAVA.

Weka classifiers work best with .arff files. Data can be supplied to them in other formats but arff files are native to Weka. We can use RWeka package in R to export training, testing and other datasets in .arff files.

library(RWeka)
write.arff(data_train,"train.arff",eol = "\n")

Now for the sample JAVA code for Hellinger Tree based Model

import weka.classifiers.Evaluation;
import weka.classifiers.trees.HTree;
import weka.core.Instance;
import weka.core.Instances;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class hellingerDT {

/**
* @param args
* @throws FileNotFoundException
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub

//Reading data from training, test, classify arff files
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("location/train.arff"));
Instances train = new Instances (breader);
//Setting the first column as class variable, Indicating to the model that this is the
train.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/test.arff"));
Instances test = new Instances (breader);
test.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/classify.arff"));
Instances classify = new Instances (breader);
classify.setClassIndex(0);

breader.close();

//Instantiate a Hellinger Tree model
HTree hT = new HTree();

//Train the model
hT.buildClassifier(train);

//Evaluating a model, using test data
Evaluation eval = new Evaluation(train);
eval.evaluateModel(hT, test);
//Display the metrics
System.out.println(eval.toSummaryString("Review Classification Hellinger Tree", true));
System.out.println("Precision " + eval.precision(1)*100+" and Recall "+eval.recall(1)*100);

//Printing the Tree
System.out.println(hT.graph());

//Classifying New Data
for (int i = 0; i < classify.numInstances(); i++) {
double pred = hT.classifyInstance(classify.instance(i));
System.out.println(pred);
}

}

}

MonteCarlo Simulation in R

Let us try a complex probability calculation, The problem statement is What is the probability of having at least one girl child in a family which may have either 1, 2, 3, 4 children. This as you see is a complex calculation for a non-mathematician and as a simple being I would rather employ Montecarlo simulation technique. What this technique does states is if you simulate a very large number of experiments then the probability of the events approaches the correct value. So Let’s do montecarlo for the above problem in R.

# Simulating 10000 families
a<-rep(1,10000)
# Simulating number of Kids in each family
b<-sapply(a,function(x) sample(1:4,1))
# Simulating male and female kids in each family 1=female, 0=male
c<-sapply(b, function(x) sample(0:1,x, replace = TRUE))
# Finding out families with atleast one female kid
which(sapply(c,sum)>=1) # I got 7631
# probality is calculated
7631/10000 # 0.7631

That was it. The probability is ~0.7631.