Binning Measure in Test Dataset based on Training Dataset Levels

Hi,

Many of you might have converted numeric to categorical variables using cut function in the training set. And once the decision tree is built, it is time to run the decision tree on Test data set. But the test data set still has continuous measures. If you have binned the data set before splitting it may cause an issue because your model has seen the test data set and that is not optimal.

Below is an R function I have written to bin a continuous variable based on an already binned variable;


bin.test.measures <- function(levels, measure){
level.frame <- data.frame("",0,0)
names(level.frame) <- c("level","lower","upper")
for(level in levels){
lowermatch <- str_match(level, pattern = '[\\(\\[].+,')
lower <- as.numeric(str_sub(lowermatch, 2, str_length(lowermatch)-1))
uppermatch <- str_match(level, pattern = ',.+[\\)\\]]')
upper <- as.numeric(str_sub(uppermatch, 2, str_length(uppermatch)-1))
temp <- data.frame(level, lower, upper)
names(temp) <- names(level.frame)
level.frame <- rbind(level.frame, temp)
}
level.frame <- level.frame[-1,]
binned <- c(as.factor("T"))
for(number in measure){
for(i in 1:nrow(level.frame)){
if(i==1){
if(number>=level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i < nrow(level.frame)) {
if(number>level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i == nrow(level.frame)) {
if(number>level.frame[i,2]){
binned <- c(binned, as.character(level.frame[i,1]))
}
}
}
}
binned <- binned[-1]
binned
}

Hope this helps!!!

Time Savers

This article is a list of time saving code snippets;

1. List out all files in a folder using R;


file.list <- list.files('C:/Users/MyName/Desktop/')
write.table(file.list,'C:/Users/MyName/Desktop/files.txt')

2. Creating a bunch of variables without any Manual intervention;

mylist <- c(1:10)
for(i in mylist){
assign(paste('movie',i,sep=''),i+10)
}

I will keep adding more.

Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

Reading humungous csv files in R

Hi,

The other day, for kaggle I had to read a csv file into R. It was around 4GB in size. None of the text editors were able to open the file. Even Access text import failed. I had to work with a macro or had to look at open source RDBMS like MySql. Luckily, my teammate gave me the idea of using file stream reader and writers in Java, Also my manager showed me this link.

I didn’t need to read the entire file, for initial experiments I needed only a part of the file. So, the below is a code snippet to get a part of the csv;

# initializing file readers
x<-file('extra_unsupervised_data.csv','rt')
x
y<-file('unsupervised_trim.csv','wt')
y

# reading lines
line<-readLines(x,n=10001)

# writing lines
cat(line,file='unsupervised_trim.csv',fill=TRUE)

# closing files
close(x)
close(y)

May this snippet save you some time!!!!!

R SMOTE Function – Reminder

SMOTE is a wonderful function in R. It belongs to the package DMwR. It does something very unique. It helps in filling under balanced classes. If your data is imbalanced, SMOTE helps generate more examples of the imbalanced class using a argument specified value based nearest neighbors. There’s one catch though. It cannot work with multi class data. It works only with two classes. You will need to do some special data transformations to smote it.

Try testing your multi class data with the below syntax;

train_cl1s<-SMOTE(V1 ~ ., train_cl1, perc.over = 1500, perc.under = 0, k=11)

now try to check the resultant data frame. It will consist only of one class and not more than it as SMOTE recognizes only two classes.

Repeating Numbers in R without Loop structures

hello,

The other day, I needed to repeat a values based on a sequence of another values; for e.g., a array of a,b,c needed to be repeated based on another array 3,2,1 and output needed to be a,a,a,b,b,c.

I started my R code with “for”…. then suddenly a booming voice echoed “If you are using “for” in R you aren’t using it properly”. So then I started thinking on how to do this without a loop;

Firstly, the code to repeat values based on a sequence of numbers using loops;

base<-c("a","b","c")
repetition<-c(3,2,1)
i<-1
x<-1:length(repetition)
new<-0
for(i in x){
new<-c(new,rep(base[i],repetition[i]))
i=i+1
}
new<-new[-c(1)]
new

Finally, the code to repeat values without sequence. The code is much cleaner, shorter and sweeter;

base<-c("a","b","c")
repetition<-c(3,2,1)
x<-data.frame(cbind(base,repetition))
x$repetition<-as.numeric(x$repetition)
rep_final<-rep(x$base,x$repetition)
rep_final

You can also use the na.locf function from zoo library for such cases. Say NO to Loops!!!!!!!

Text Mining – Data Preprocessing using R – Part 1

For Text Mining Projects, The most important and time consuming process is the Data Pre-processing phase. In this phase, you cleanse, validate and transform the training data for your Machine Learning Algorithm. Improper Pre-processing of data leads to reduction in accuracy, over-fitting/under-fitting, bias/variance in the algorithms. All the pre-processing done for the Training data will be done to test data and the incoming online data as well, to make sure the data is in the same form which can be consumed by the algorithms. There are many steps in Data Pre-processing and not all may be needed in every project. For e.g.,  Spell Checking may not be needed for Machine generated data assuming the machine will not make mistakes, Parts of Speech Tagging may not be needed for Document genre classification, Abbreviation Handling may not be need in certain cases. Below are some of the text pre-processing steps with relevant R code.

1. Data Loading;

Data can be loaded into R from many sources. Functions for reading files, accessing Database tables are available in R Libraries.

#Loading source data
source_data<-read.csv("source.csv",header=FALSE)

2. Text Normalization;

In the example of Review Classification, text like 3 stars, 5 stars, 2 star can be normalized to something like ‘good’, ‘bad’, ‘average’ before tokenization or bag of words. Another example is words like ‘haven’t’, ‘shouldn’t’ etc., can be reduced to ‘not’. This would reduce the token count and complexity of handling such tokens. Tokenizing 2.5 stars would give you two tokens 2.5 and stars and may get interpreted wrongly by the machine learning algo.

source_data<-sapply(source_data,
function(x) sub("4 stars","good",x))
source_data<-sapply(source_data,
function(x) sub("1 star","bad",x))

# replacing words with n't
# not.txt is a file with words that can be replaced with not
not<-read.csv("c:/not.txt",header=FALSE,sep="\n")
repNot <- function(x) {
i<-1
while(i<=length(not[,1]))
{
x<-sub("havent","not",x)
i<-i+1
}
return (x)
}
source_data<-repNot(source_data)

3. Abbreviations;

Sometimes handling abbreviations makes sense, certain scenarios may need expansion of abbreviations and some may need creation of abbreviations. Whatever the case, the below snippet can be used by just reversing the abbreviations and expansions.

# setting up an abbreviation lookup
# reverse the vector for expansion logic
abbrev<-c("lol","omg","imo","gtfo")
abbrev<-cbind(abbrev,c("laugh out loud","oh my god",
"in my opinion","get the out"))
abbrev

# function for abbreviations
findAbbrev <- function(x) {
i<-1
while(i<=length(abbrev[,1]))
{
x<-sub(abbrev[i,1],abbrev[i,2],x)
i<-i+1
}
return (x)
}

text <- "Lol, OMG, IMO this is not a drill, haven't you heard"
text <- rbind(text,"LOL, Am I wrong, GTFO")
text<- Corpus(VectorSource(text),
readercontrol=list(reader=readPlain))
text <- tm_map(text , tolower)
text <- tm_map(text , removePunctuation)
text

# text<-sapply(text,function(x) findAbbrev(x))
text<-findAbbrev(text)
text

to be continued…,

MonteCarlo Simulation in R

Let us try a complex probability calculation, The problem statement is What is the probability of having at least one girl child in a family which may have either 1, 2, 3, 4 children. This as you see is a complex calculation for a non-mathematician and as a simple being I would rather employ Montecarlo simulation technique. What this technique does states is if you simulate a very large number of experiments then the probability of the events approaches the correct value. So Let’s do montecarlo for the above problem in R.

# Simulating 10000 families
a<-rep(1,10000)
# Simulating number of Kids in each family
b<-sapply(a,function(x) sample(1:4,1))
# Simulating male and female kids in each family 1=female, 0=male
c<-sapply(b, function(x) sample(0:1,x, replace = TRUE))
# Finding out families with atleast one female kid
which(sapply(c,sum)>=1) # I got 7631
# probality is calculated
7631/10000 # 0.7631

That was it. The probability is ~0.7631.

Looping in R (without For or While statements)

Salutations Reader,

The other day I found a guy who made a statement, which was something like “If you are using “for” in R you aren’t using it properly”. Firstly, I was surprised on this. How do I write loops without a for? How can I do Monte Carlo Simulations without for statement? Then I got to thinking on what the guy meant. R is good at vector manipulation. Can I use this to avoid loops? Let me show this with a Monte Carlo Simulation to calculate the probability of a Head in Coin Toss(Assumption: Coin is unbiased and there is justice in the universe).

Flipping 100 coins;

a<-rep(1,100)
a<-sapply(a,function(x) sample(0:1,x))

probability=sum(a)/100= 0.51

Flipping 100000 coins;

b<-rep(1,100000)
b<-sapply(b,function(x) sample(0:1,x))

probability=sum(b)/100000=0.499

Now I understand why the guy said what he said. Now Go forth and simulate!!!!