Repeating Numbers in R without Loop structures

hello,

The other day, I needed to repeat a values based on a sequence of another values; for e.g., a array of a,b,c needed to be repeated based on another array 3,2,1 and output needed to be a,a,a,b,b,c.

I started my R code with “for”…. then suddenly a booming voice echoed “If you are using “for” in R you aren’t using it properly”. So then I started thinking on how to do this without a loop;

Firstly, the code to repeat values based on a sequence of numbers using loops;

base<-c("a","b","c")
repetition<-c(3,2,1)
i<-1
x<-1:length(repetition)
new<-0
for(i in x){
new<-c(new,rep(base[i],repetition[i]))
i=i+1
}
new<-new[-c(1)]
new

Finally, the code to repeat values without sequence. The code is much cleaner, shorter and sweeter;

base<-c("a","b","c")
repetition<-c(3,2,1)
x<-data.frame(cbind(base,repetition))
x$repetition<-as.numeric(x$repetition)
rep_final<-rep(x$base,x$repetition)
rep_final

You can also use the na.locf function from zoo library for such cases. Say NO to Loops!!!!!!!

Hbase 101 and Tutorial

Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.

The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.

Advantages of Columnar databases;

1. Very good for Sparse Data scenarios

2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)

3. Store Version history by default. Built with inherent SCD capability.

Limitations of Hbase;

1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.

Below are some of the important commands to get started;

list –> Lists all the tables in the HBase DB

create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.

e.g., create ‘tx’, ‘cf1’

describe –> Used to display Table stats.

e.g., describe ‘tx’

put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.

e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′

put ‘tx’,’row1′ ,’cf1:col2′,’col2′

get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.

e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

scan –> display specifications of the said table

e.g., scan ‘tx’

disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.

e.g., disable ‘tx’

enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.

e.g., enable ‘tx’

drop –> deleting the said table.

e.g., drop ‘tx’

Hadoop Walkthough;

Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;

Inserting/Updating rows into the table;

put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘

put ‘tx’,’row2′ ,’cf1:col1′,’col21′

In the below command line we are adding a new column to cf1 family during run time.

put ‘tx’,’row2′ ,’cf1:col3′,’col31′

Retrieving data entirely/column specific from the table;

get ‘tx’,’row2′

get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

Schema Reduction Example;

Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.

create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}

put ’employee’,’emp1′,’details:name’,’emp_1′

put ’employee’,’emp1′,’details:id’,1

put ’employee’,’emp1′,’team:id’,10

put ’employee’,’emp1′,’team:name’,’coke’

put ’employee’,’emp1′,’team:name’,’pepsi’

get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}

disable ’employee’

alter ’employee’, ‘financial’

enable ’employee’

get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}

that is Hbase at 30,000 feet…..

Text Mining – Data Preprocessing using R – Part 1

For Text Mining Projects, The most important and time consuming process is the Data Pre-processing phase. In this phase, you cleanse, validate and transform the training data for your Machine Learning Algorithm. Improper Pre-processing of data leads to reduction in accuracy, over-fitting/under-fitting, bias/variance in the algorithms. All the pre-processing done for the Training data will be done to test data and the incoming online data as well, to make sure the data is in the same form which can be consumed by the algorithms. There are many steps in Data Pre-processing and not all may be needed in every project. For e.g.,  Spell Checking may not be needed for Machine generated data assuming the machine will not make mistakes, Parts of Speech Tagging may not be needed for Document genre classification, Abbreviation Handling may not be need in certain cases. Below are some of the text pre-processing steps with relevant R code.

1. Data Loading;

Data can be loaded into R from many sources. Functions for reading files, accessing Database tables are available in R Libraries.

#Loading source data
source_data<-read.csv("source.csv",header=FALSE)

2. Text Normalization;

In the example of Review Classification, text like 3 stars, 5 stars, 2 star can be normalized to something like ‘good’, ‘bad’, ‘average’ before tokenization or bag of words. Another example is words like ‘haven’t’, ‘shouldn’t’ etc., can be reduced to ‘not’. This would reduce the token count and complexity of handling such tokens. Tokenizing 2.5 stars would give you two tokens 2.5 and stars and may get interpreted wrongly by the machine learning algo.

source_data<-sapply(source_data,
function(x) sub("4 stars","good",x))
source_data<-sapply(source_data,
function(x) sub("1 star","bad",x))

# replacing words with n't
# not.txt is a file with words that can be replaced with not
not<-read.csv("c:/not.txt",header=FALSE,sep="\n")
repNot <- function(x) {
i<-1
while(i<=length(not[,1]))
{
x<-sub("havent","not",x)
i<-i+1
}
return (x)
}
source_data<-repNot(source_data)

3. Abbreviations;

Sometimes handling abbreviations makes sense, certain scenarios may need expansion of abbreviations and some may need creation of abbreviations. Whatever the case, the below snippet can be used by just reversing the abbreviations and expansions.

# setting up an abbreviation lookup
# reverse the vector for expansion logic
abbrev<-c("lol","omg","imo","gtfo")
abbrev<-cbind(abbrev,c("laugh out loud","oh my god",
"in my opinion","get the out"))
abbrev

# function for abbreviations
findAbbrev <- function(x) {
i<-1
while(i<=length(abbrev[,1]))
{
x<-sub(abbrev[i,1],abbrev[i,2],x)
i<-i+1
}
return (x)
}

text <- "Lol, OMG, IMO this is not a drill, haven't you heard"
text <- rbind(text,"LOL, Am I wrong, GTFO")
text<- Corpus(VectorSource(text),
readercontrol=list(reader=readPlain))
text <- tm_map(text , tolower)
text <- tm_map(text , removePunctuation)
text

# text<-sapply(text,function(x) findAbbrev(x))
text<-findAbbrev(text)
text

to be continued…,

Building Hellinger Distance Decision Trees

If you have stumbled on this page, I assume there is still no Package in R for Hellinger Decision Trees.

When I wanted to build a model using Hellinger Tree, my natural instinct was to search for a package in R. When I found none, I searched for its implementation in any other Language, I couldn’t find any implementation snippets. So, this is my post that could save your valuable time.

So, using this jar, I had updated the base Weka jar and found the Hellinger Tree under the classify tab in Weka Explorer but couldn’t work with it somehow, Maybe because I was new to Weka. I didn’t have the time to learn Weka and I needed to build the model ASAP with only hellinger distance decision tree and quickly. I used the jar to build the tree in our old dependable friend JAVA.

Weka classifiers work best with .arff files. Data can be supplied to them in other formats but arff files are native to Weka. We can use RWeka package in R to export training, testing and other datasets in .arff files.

library(RWeka)
write.arff(data_train,"train.arff",eol = "\n")

Now for the sample JAVA code for Hellinger Tree based Model

import weka.classifiers.Evaluation;
import weka.classifiers.trees.HTree;
import weka.core.Instance;
import weka.core.Instances;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class hellingerDT {

/**
* @param args
* @throws FileNotFoundException
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub

//Reading data from training, test, classify arff files
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("location/train.arff"));
Instances train = new Instances (breader);
//Setting the first column as class variable, Indicating to the model that this is the
train.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/test.arff"));
Instances test = new Instances (breader);
test.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/classify.arff"));
Instances classify = new Instances (breader);
classify.setClassIndex(0);

breader.close();

//Instantiate a Hellinger Tree model
HTree hT = new HTree();

//Train the model
hT.buildClassifier(train);

//Evaluating a model, using test data
Evaluation eval = new Evaluation(train);
eval.evaluateModel(hT, test);
//Display the metrics
System.out.println(eval.toSummaryString("Review Classification Hellinger Tree", true));
System.out.println("Precision " + eval.precision(1)*100+" and Recall "+eval.recall(1)*100);

//Printing the Tree
System.out.println(hT.graph());

//Classifying New Data
for (int i = 0; i < classify.numInstances(); i++) {
double pred = hT.classifyInstance(classify.instance(i));
System.out.println(pred);
}

}

}

MonteCarlo Simulation in R

Let us try a complex probability calculation, The problem statement is What is the probability of having at least one girl child in a family which may have either 1, 2, 3, 4 children. This as you see is a complex calculation for a non-mathematician and as a simple being I would rather employ Montecarlo simulation technique. What this technique does states is if you simulate a very large number of experiments then the probability of the events approaches the correct value. So Let’s do montecarlo for the above problem in R.

# Simulating 10000 families
a<-rep(1,10000)
# Simulating number of Kids in each family
b<-sapply(a,function(x) sample(1:4,1))
# Simulating male and female kids in each family 1=female, 0=male
c<-sapply(b, function(x) sample(0:1,x, replace = TRUE))
# Finding out families with atleast one female kid
which(sapply(c,sum)>=1) # I got 7631
# probality is calculated
7631/10000 # 0.7631

That was it. The probability is ~0.7631.

Looping in R (without For or While statements)

Salutations Reader,

The other day I found a guy who made a statement, which was something like “If you are using “for” in R you aren’t using it properly”. Firstly, I was surprised on this. How do I write loops without a for? How can I do Monte Carlo Simulations without for statement? Then I got to thinking on what the guy meant. R is good at vector manipulation. Can I use this to avoid loops? Let me show this with a Monte Carlo Simulation to calculate the probability of a Head in Coin Toss(Assumption: Coin is unbiased and there is justice in the universe).

Flipping 100 coins;

a<-rep(1,100)
a<-sapply(a,function(x) sample(0:1,x))

probability=sum(a)/100= 0.51

Flipping 100000 coins;

b<-rep(1,100000)
b<-sapply(b,function(x) sample(0:1,x))

probability=sum(b)/100000=0.499

Now I understand why the guy said what he said. Now Go forth and simulate!!!!

Resolving Infinite Hierarchy levels using Recursive Queries

Typically,

Most Reporting Tools do not support recursive querying. This is critical for scenarios where you need to traverse through a table multiple times during run time. One simple scenario is shown below;

Let us take Adam’s Family Tree example;

Any O.L.T.P application built to display this data will probably be storing it in the following way;

Why? To save space, improve Performance.

create table parent_child (parent varchar(max), child varchar(max))

insert into parent_child values(‘Adam’,’Abel’);
insert into parent_child values(‘Adam’,’Cain’);
insert into parent_child values(‘Adam’,’Seth’);
insert into parent_child values(‘Cain’,’Enoch’);
insert into parent_child values(‘Enoch’,’Irad’);
insert into parent_child values(‘Irad’,’Mehujael’);
insert into parent_child values(‘Seth’,’Enos’);
insert into parent_child values(‘Enos’,’Cainan’);
insert into parent_child values(‘Cainan’,’Mahaleel’);
insert into parent_child values(‘Mahaleel’,’Jared’);

select * from parent_child;

But the users may need a report showing the Ancestry details and the generation level which the Application may show using data structures, Arrays and run time coding. How, Can a reporting tool do this? You can create specialized tables but can you support infinite levels? Not all table entities will have the same number of levels leading to nulls in columns.

Solution: Databases support Recursive Queries. And Most of them follow standard ANSI syntax which means one code can run on multiple databases.

The Ancestry Details can now be found as;

WITH cte_name ( Ancestor, Descendent ,lvl )
AS
(
select PARENT, CHILD, 1 as lvl
from
parent_child
UNION ALL
select cte_name.Ancestor, parent_child.CHILD, lvl+1 from
parent_child inner join cte_name on cte_name.descendent=parent_child.PARENT
)
SELECT *
FROM cte_name order by Ancestor,lvl

This is only one type of scenario that has been solved. We can use the Recursive queries to solve other run time scenarios as well.

The First Domino…(BVTA part-II)

The first rays of the day touch the asylum’s outer walls. The edifice feels uncomfortable and is unwelcoming to the light. A single streak of light traverses the office floor to reach Jeremaiah’s hand. He realizes the time. He had not slept a wink for the past few weeks. He had promised he would avenge his uncle’s death by ridding the asylum of madness, he toils for the cause. The madness that overtook his uncle leading to his death, brought him to a point where he loathed humanity. He explored every means to cure the patients… drugs, abandoned experiments, controversial procedures like ludiwigo which often ended up harming the patients to point of no return. Deaths turned into Scandals but no one could stop him. Politicians didn’t want the looneys on their streets and only one person would take them. After he fell, his nephew took the mantle.

Jeremaiah walks to the window. The inmates come out into the yard for their morning walk. One look at them and feelings of disgust, sadness, sympathy overcome him. His helplessness is evident in his face. “Why can’t I cure them?” He asks himself. He looks at the Amadeus’ portrait. Suddenly the phone beeps. “Sir, There’s a situation down here.”

Jeremaiah opens the door and is greeted with flashes, microphones. “Doctor, How do you feel about moving the patients?”. Jeremaiah feels clueless. He sees Shondra talking to the reporters. “What is going on here, Ms. Shondra?. Shondra replies, “It’s Doctor Shondra. I am here to shift some of the patients to my facility. It has come to light that the asylum is not a proper facility to treat these patients.”. “What!!?? Do you know this asylum has been treating patients even before you were born. I won’t let you do this.” “You can’t stop me, I have the Mayor’s authorization. We need you to co-operate with us.”. Jeremaiah reads the letter, There was nothing he could do. He motions his crew. “Here is the list of patients I am taking” Shondra says. Jeremaiah shouts, “Not him, you can’t handle him, He’s is a psychopath. He’s too violent. No one can cure him. He is beyond hope.”. Shondra say’s “Then why didn’t you put him on the chair till now?”.

The patients are lined up, restrained and taken. They start exiting the asylum and walk towards the bus. One of them moves slowly compared to the others. Jeremaiah eyes him. The inmate smiles. He looks at one of the guards. “Bob…Bob, Is that you? I will take care of your mother.”. Bob is puzzled, “but she’s in New York, And he’s just moving to another part of this city.”. Silently, Jeremaiah retreats into his office.

The bus roles out of the asylum. Three convoys escort the truck. Thirty minutes pass as the bus moves onto a deserted highway.  “BOOM !!!!” The three convoys explode in sync. All the inmates gasp. The guards in the bus turn off their safeties, one of them walks to the rear end of the truck. One more explosion, the driver cabin is shredded and the blast tears the truck apart. Only one guard survives. He hobbles upto Kerr. Points his gun towards him…….

Bad Comedy…….(BVTA Part-1)

It was always dark in the asylum but tonight was something else all together. The night held a cloak of darkness as if it had a purpose, giving birth to innumerable horrors that was.

Two guards on the night shift talk to each other. “Man, New York was messed up bad. My mother’s stuck there. I donno what to do. Do you have anyone who can get her outta there? You think things are gonna get better?”. The other replies, “No man, ain’t got no one in there. I hope she makes it through. That place is only gonna get worse. With the army and what’s left of the P.D stuck with the rehabilitation work;  they can’t take the gangs. I heard new ones are popping up like mushrooms.”.

A spine chilling laughter cuts through the cold, dank, drug odored air. “Heeeeeenhihhiheeeeeeee .. cough ..cough”. “Bob….Bob?, is that you. I know your mother. That old dog. She is in great danger my friend. One look at her and they would think one of those aliens survived. They would probably turn her in for autopsy”. Bob retorts “Shut up you old coot”.

The old man in the isolated cell replies “Did I offend you? Seems I have gone bad at my game.”. The old man says, “The docs are pumping me these drugs. They have weakened my funny bone.”. “Still not doing it for you eh… Bob.” Bob has had enough, “He’s done it again. He’s thrown up his sedatives. I’ll have to shut him up the old way”. He enters the old man’s cell. “Bob, you are here…. Lets play a game.”. Bob raises his baton, “No more games. Time to sleep.” ” I can save your mother. I can get her out. Do this for me. Get this letter to Dr. Shondra K. She can help me. Help me help you.” “Are you kidding me? you better not be.”. The old man says, “You have got no other choice Bob, Mother can’t be shipped to Roswell in a bag…hinssssssshsssss”. Bob lashes one final shot at the old man and quips, “By god, if you play me….”, The old man hits the floor like a rag doll and bleeds. Bob leaves, crumpled note in hand.

10 Hours later,

Beep.Beep….Beep.Beep. A phone rings in an office. “Dr. Shondra, you have a visitor from the asylum.”.”Give me five minutes.”. “Ok, Doctor.”. The receptionist asks Bob to wait. His mother’s image still circling his mind. He feels nervous. He thinks, “What if she thinks this is big joke…What if she doesn’t take it seriously. The letter is going to make news. Big news, Bad news. If taken seriously.”. “The Doctor will meet you now.”. Bob saw Shondra and suddenly he felt at ease, her face, angelic smile instantly calmed him down. For a few seconds, he could forget his predicament. She asks, “How is Jeremiah?”. “Well, he’s busy.”. She asks, “What is he busy with?”. He said, “The usual stuff.”. “I was afraid you would say that.”, She says with a concern on her face. She never liked what Jeremiah did. He was the opposite of what she stood for. Bob handed her the letter, he had nothing to say. He was a pawn in a grand scheme of things that he had no idea of. All he wanted was his small wish. The letter was crumpled, the writing on it was legible but clearly felt like a whole mob was trying to write it, fighting while they were at it. She heard the door slam behind her. The words of the letter speak……

“Hi Doc, howwww arrrre yoouuuu!! apppoologiies forrr thhhhe slurriness, Drugged as I am. Things here are terrible, went bad to worse. Ain’t no place for an old man. drugs, experiments can’t handle em doc. What year is this!!!! get me outta hereeeeeee or else I will go MAD

Mr. Kerr”

[Next]