Binning Measure in Test Dataset based on Training Dataset Levels

Hi,

Many of you might have converted numeric to categorical variables using cut function in the training set. And once the decision tree is built, it is time to run the decision tree on Test data set. But the test data set still has continuous measures. If you have binned the data set before splitting it may cause an issue because your model has seen the test data set and that is not optimal.

Below is an R function I have written to bin a continuous variable based on an already binned variable;


bin.test.measures <- function(levels, measure){
level.frame <- data.frame("",0,0)
names(level.frame) <- c("level","lower","upper")
for(level in levels){
lowermatch <- str_match(level, pattern = '[\\(\\[].+,')
lower <- as.numeric(str_sub(lowermatch, 2, str_length(lowermatch)-1))
uppermatch <- str_match(level, pattern = ',.+[\\)\\]]')
upper <- as.numeric(str_sub(uppermatch, 2, str_length(uppermatch)-1))
temp <- data.frame(level, lower, upper)
names(temp) <- names(level.frame)
level.frame <- rbind(level.frame, temp)
}
level.frame <- level.frame[-1,]
binned <- c(as.factor("T"))
for(number in measure){
for(i in 1:nrow(level.frame)){
if(i==1){
if(number>=level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i < nrow(level.frame)) {
if(number>level.frame[i,2] && number<=level.frame[i,3]){
binned <- c(binned, as.character(level.frame[i,1]))
}
} else if(i == nrow(level.frame)) {
if(number>level.frame[i,2]){
binned <- c(binned, as.character(level.frame[i,1]))
}
}
}
}
binned <- binned[-1]
binned
}

Hope this helps!!!

Building Data Visualizations with SVG+HTML+JavaScript

Hi,

If your day job is to build reports, I am pretty sure you must have at least once gotten comments similar to “Oh, can we move the tool tip inside the bar in the bar chart?” or “Can you also add a trend line to a Stacked Bar Chart?” and then you rush back to Excel, Tableau, BO, Cognos etc., and go check the chart’s options to see if that is available. Recently while learning JavaScript I came to learn an awesome concept called Scalable Vector Graphic. I found that JavaScript + HTML + CSS which is already an awesome combo works very well with SVG. So I set about my all time to-do task i.e., to build a chart from scratch which could be tweaked easily and plugged over data. So, without further ado…

svgChart

Now, I can play around with anything here., let’s go crazy.

crazyChart

Now, for the Code….

<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<title></title>
<style>

</style>
</head>
<body>
<script type='text/javascript'>
//console.log('Initializing....');
var barWidth = 30, canvasHeight = 300, canvasWidth = 800, barSpacing=20,toolTipPos = -25;
var data = [['India',200],['USA',60],['Africa',120],['Europe',95],['Australia',45],['Asia',20]];
var svgBase = document.createElementNS('http://www.w3.org/2000/svg','svg');
svgBase.setAttribute('height',canvasHeight);
svgBase.setAttribute('width',canvasWidth);
document.body.appendChild(svgBase);
//Function for Rendering Line
var drawLine = function(x1,y1,x2,y2){
var tempLine = document.createElementNS('http://www.w3.org/2000/svg','line');
tempLine.setAttribute('x1',x1);
tempLine.setAttribute('y1',y1);
tempLine.setAttribute('x2',x2);
tempLine.setAttribute('y2',y2);
tempLine.setAttribute('style','stroke:rgb(0,255,0);stroke-width:1');
svgBase.appendChild(tempLine);
};
//Rendering Y axis
drawLine(barWidth,10,barWidth,canvasHeight);
//Rendering X axis
drawLine(0,(canvasHeight-40),barWidth*data.length*2.2,(canvasHeight-40));
for(var i=0; i<data.length; i++){
//Rendering Bars
var tempRect = document.createElementNS('http://www.w3.org/2000/svg','rect');
tempRect.setAttribute('x',(barWidth+barSpacing)*(i+1));
tempRect.setAttribute('y',(canvasHeight-50)-data[i][1]);
tempRect.setAttribute('height',data[i][1]);
tempRect.setAttribute('width',barWidth);
tempRect.setAttribute('id','Rect'+i);
tempRect.setAttribute('fill','blue');
svgBase.appendChild(tempRect);
//Rendering Tooltips
var toolTip = document.createElementNS('http://www.w3.org/2000/svg','text');
toolTip.setAttribute('x',(barWidth+barSpacing)*(i+1)+4.5);
toolTip.setAttribute('y',(canvasHeight-45)-data[i][1]+toolTipPos);
toolTip.setAttribute('fill','black');
toolTip.textContent = data[i][1];
svgBase.appendChild(toolTip);
//Rendering X-Axis Labels
var seriesName = document.createElementNS('http://www.w3.org/2000/svg','text');
seriesName.setAttribute('x',(barWidth+barSpacing)*(i+1)-2);
seriesName.setAttribute('y',(canvasHeight-20));
seriesName.setAttribute('fill','red');
seriesName.setAttribute('id','xlabel'+i);
seriesName.textContent = data[i][0]
svgBase.appendChild(seriesName);
//console.log(i+' th element done!')
};

/*var toolTip = document.createElementNS('http://www.w3.org/2000/svg','text');
toolTip.setAttribute('x',30);
toolTip.setAttribute('y',50);
toolTip.setAttribute('fill','black');
toolTip.setAttribute('transform','rotate(30 50, 45)');
toolTip.textContent = 'Junk';
svgBase.appendChild(toolTip);*/
</script>
</body>
</html>

I haven’t spent much time on making this code leaner, but it is possible to define functions and objects and then
be able to switch between different charts on the fly.

HTML Manipulation using LeapMotion

Hi,

Recently, I have gotten hold of an excellent piece of equipment called the Leap Motion, And believe me it is awesome.

The below tutorial gives an example on how to manipulate HTML using a Leap Motion Javascript Library. The expectation of the tutorial is that the reader has understanding of basic programming concepts like Loops, Conditions, Functions, Objects, Methods, Properties and HTML.

The purpose of the below code is to move an image across the webpage using the inputs from a Leap Motion.


<!DOCTYPE html>
<html>
<head>
<title>Leap Graph Explorer</title>
<script src="leap.js"></script>
</head>
<body>
<script>
var controllerOptions = {enableGestures: true};
Leap.loop(controllerOptions, function(frame){
var img = document.getElementById("pic");
var wid = img.width;
var ht = img.height;
if(frame.gestures.length>=1){
document.getElementById("test").innerHTML = frame.gestures[0].type;
if(frame.gestures[0].type=="keyTap"){
img.style.width=(wid*1.1)+"px";
img.style.height=(ht*1.1)+"px"
}
if(frame.gestures[0].type=="swipe"){
img.src="./screenshot1.png"
}
if(frame.gestures[0].type=="circle"){
img.style.width=892+"px";
img.style.height=580+"px"
}
}
if(frame.hands.length > 0){
var hand = frame.hands[0];
//vectorToString(hand.palmPosition);
//document.getElementById("test").innerHTML = hand.sphereRadius;
document.getElementById("pic").style.left = hand.palmPosition[0]-100+'px';
document.getElementById("pic").style.top = hand.palmPosition[1]-50+'px';
}
})
</script>
<div id="test">start</div>
<img id="pic" src="./screenshot12.png" style="position: absolute; top: 20px; left: 15px"/></div>
</body>
</html>

As you can see that the most important part of our code is the JavaScript. Let us have a closer look shall we;


var controllerOptions = {enableGestures: true};
Leap.loop(controllerOptions, function(frame){
var img = document.getElementById("pic");
var wid = img.width;
var ht = img.height;
if(frame.gestures.length>=1){
document.getElementById("test").innerHTML = frame.gestures[0].type;
if(frame.gestures[0].type=="keyTap"){
img.style.width=(wid*1.1)+"px";
img.style.height=(ht*1.1)+"px"
}
if(frame.gestures[0].type=="swipe"){
img.src="./screenshot1.png"
}
if(frame.gestures[0].type=="circle"){
img.style.width=892+"px";
img.style.height=580+"px"
}
}
if(frame.hands.length > 0){
var hand = frame.hands[0];
//vectorToString(hand.palmPosition);
//document.getElementById("test").innerHTML = hand.sphereRadius;
document.getElementById("pic").style.left = hand.palmPosition[0]-100+'px';
document.getElementById("pic").style.top = hand.palmPosition[1]-50+'px';
}
})

To keep the post concise, I will quickly go through what is being done in the code.

  • Leap Motion as any other Controller Device runs an infinite loop polling for input updates from the user, Which can seen by Leap.Loop Method.
  • We are enabling default gestures like swipe, circle, tap by setting the controllerOptions variable.
  • We can get the image attributes using an object built on the document object.
  • To display the gesture being done, we display in a html division called “test”
  • We update the image’s height and width by increasing at a rate of 1.1px for each “tap” gesture. Similarly “circle” gesture just resets the size back to original.
  • Next in the line, is the code to change the image’s x and y based on the palm position returned by Leap Motion hand Object.

This is how we can achieve simple html manipulation using Leap Motion and JavaScript.

Time Savers

This article is a list of time saving code snippets;

1. List out all files in a folder using R;


file.list <- list.files('C:/Users/MyName/Desktop/')
write.table(file.list,'C:/Users/MyName/Desktop/files.txt')

2. Creating a bunch of variables without any Manual intervention;

mylist <- c(1:10)
for(i in mylist){
assign(paste('movie',i,sep=''),i+10)
}

I will keep adding more.

Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

How to convert pixelmap to a Picture – JAVA

Hi,

As part of the Kaggle Challenge, I have had to visualize training data which is a bunch of numbers with pixel intensity. To convert this into a picture file like jpeg or bmp I needed convertors. I found some JAVA code online but it didn’t solve my problem. I needed to tweak it a bit to make it work for my example.

The input to the program is a pixel map like this;

Pixel Map

To convert this into a picture, you can use the below code;

note: My pixel map is of 2304 pixel intensities which translates to 48*48 Pixel image. Hence, the code below works with this. For other dimensions update the pic accordingly;

import java.awt.image.BufferedImage;
import java.awt.image.WritableRaster;
import java.io.File;
import javax.imageio.ImageIO;

public class pixtoImage {

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
File imageFile = new File("D:/Pics/0-6.png");
//3 bands in TYPE_INT_RGB
int NUM_BANDS = 3;
int[] pixelMap = new int[48 * 48 * NUM_BANDS];
int band;
for (int i = 0; i < 48; i++)
{
for (int j = 0; j < 48; j++) {
for (band = 0; band < NUM_BANDS; band++){
pixelMap[((i * 48) + j)*NUM_BANDS + band] = Integer.parseInt(args[((i * 48) + j)]);

}
}
}
BufferedImage picImage = getImageFromArray(pixelMap, 48, 48);
try{
ImageIO.write(picImage, "png", imageFile);
}
catch(Exception e){
e.printStackTrace();
};
System.out.println("Written");
}

public static BufferedImage getImageFromArray(int[] pixels, int width, int height)
{
BufferedImage image =
new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
WritableRaster raster = (WritableRaster) image.getData();
raster.setPixels(0, 0, width, height, pixels);
image.setData(raster);
return image;
}
}

Output for the pixel map shown in the image above is

0-6

May this snippet save you some time, Have fun…..

Reading humungous csv files in R

Hi,

The other day, for kaggle I had to read a csv file into R. It was around 4GB in size. None of the text editors were able to open the file. Even Access text import failed. I had to work with a macro or had to look at open source RDBMS like MySql. Luckily, my teammate gave me the idea of using file stream reader and writers in Java, Also my manager showed me this link.

I didn’t need to read the entire file, for initial experiments I needed only a part of the file. So, the below is a code snippet to get a part of the csv;

# initializing file readers
x<-file('extra_unsupervised_data.csv','rt')
x
y<-file('unsupervised_trim.csv','wt')
y

# reading lines
line<-readLines(x,n=10001)

# writing lines
cat(line,file='unsupervised_trim.csv',fill=TRUE)

# closing files
close(x)
close(y)

May this snippet save you some time!!!!!

R SMOTE Function – Reminder

SMOTE is a wonderful function in R. It belongs to the package DMwR. It does something very unique. It helps in filling under balanced classes. If your data is imbalanced, SMOTE helps generate more examples of the imbalanced class using a argument specified value based nearest neighbors. There’s one catch though. It cannot work with multi class data. It works only with two classes. You will need to do some special data transformations to smote it.

Try testing your multi class data with the below syntax;

train_cl1s<-SMOTE(V1 ~ ., train_cl1, perc.over = 1500, perc.under = 0, k=11)

now try to check the resultant data frame. It will consist only of one class and not more than it as SMOTE recognizes only two classes.

Repeating Numbers in R without Loop structures

hello,

The other day, I needed to repeat a values based on a sequence of another values; for e.g., a array of a,b,c needed to be repeated based on another array 3,2,1 and output needed to be a,a,a,b,b,c.

I started my R code with “for”…. then suddenly a booming voice echoed “If you are using “for” in R you aren’t using it properly”. So then I started thinking on how to do this without a loop;

Firstly, the code to repeat values based on a sequence of numbers using loops;

base<-c("a","b","c")
repetition<-c(3,2,1)
i<-1
x<-1:length(repetition)
new<-0
for(i in x){
new<-c(new,rep(base[i],repetition[i]))
i=i+1
}
new<-new[-c(1)]
new

Finally, the code to repeat values without sequence. The code is much cleaner, shorter and sweeter;

base<-c("a","b","c")
repetition<-c(3,2,1)
x<-data.frame(cbind(base,repetition))
x$repetition<-as.numeric(x$repetition)
rep_final<-rep(x$base,x$repetition)
rep_final

You can also use the na.locf function from zoo library for such cases. Say NO to Loops!!!!!!!

Hbase 101 and Tutorial

Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.

The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.

Advantages of Columnar databases;

1. Very good for Sparse Data scenarios

2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)

3. Store Version history by default. Built with inherent SCD capability.

Limitations of Hbase;

1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.

Below are some of the important commands to get started;

list –> Lists all the tables in the HBase DB

create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.

e.g., create ‘tx’, ‘cf1’

describe –> Used to display Table stats.

e.g., describe ‘tx’

put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.

e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′

put ‘tx’,’row1′ ,’cf1:col2′,’col2′

get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.

e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

scan –> display specifications of the said table

e.g., scan ‘tx’

disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.

e.g., disable ‘tx’

enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.

e.g., enable ‘tx’

drop –> deleting the said table.

e.g., drop ‘tx’

Hadoop Walkthough;

Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;

Inserting/Updating rows into the table;

put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘

put ‘tx’,’row2′ ,’cf1:col1′,’col21′

In the below command line we are adding a new column to cf1 family during run time.

put ‘tx’,’row2′ ,’cf1:col3′,’col31′

Retrieving data entirely/column specific from the table;

get ‘tx’,’row2′

get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

Schema Reduction Example;

Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.

create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}

put ’employee’,’emp1′,’details:name’,’emp_1′

put ’employee’,’emp1′,’details:id’,1

put ’employee’,’emp1′,’team:id’,10

put ’employee’,’emp1′,’team:name’,’coke’

put ’employee’,’emp1′,’team:name’,’pepsi’

get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}

disable ’employee’

alter ’employee’, ‘financial’

enable ’employee’

get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}

that is Hbase at 30,000 feet…..