Text Mining – Data Preprocessing using R – Part 1

For Text Mining Projects, The most important and time consuming process is the Data Pre-processing phase. In this phase, you cleanse, validate and transform the training data for your Machine Learning Algorithm. Improper Pre-processing of data leads to reduction in accuracy, over-fitting/under-fitting, bias/variance in the algorithms. All the pre-processing done for the Training data will be done to test data and the incoming online data as well, to make sure the data is in the same form which can be consumed by the algorithms. There are many steps in Data Pre-processing and not all may be needed in every project. For e.g.,  Spell Checking may not be needed for Machine generated data assuming the machine will not make mistakes, Parts of Speech Tagging may not be needed for Document genre classification, Abbreviation Handling may not be need in certain cases. Below are some of the text pre-processing steps with relevant R code.

1. Data Loading;

Data can be loaded into R from many sources. Functions for reading files, accessing Database tables are available in R Libraries.

#Loading source data
source_data<-read.csv("source.csv",header=FALSE)

2. Text Normalization;

In the example of Review Classification, text like 3 stars, 5 stars, 2 star can be normalized to something like ‘good’, ‘bad’, ‘average’ before tokenization or bag of words. Another example is words like ‘haven’t’, ‘shouldn’t’ etc., can be reduced to ‘not’. This would reduce the token count and complexity of handling such tokens. Tokenizing 2.5 stars would give you two tokens 2.5 and stars and may get interpreted wrongly by the machine learning algo.

source_data<-sapply(source_data,
function(x) sub("4 stars","good",x))
source_data<-sapply(source_data,
function(x) sub("1 star","bad",x))

# replacing words with n't
# not.txt is a file with words that can be replaced with not
not<-read.csv("c:/not.txt",header=FALSE,sep="\n")
repNot <- function(x) {
i<-1
while(i<=length(not[,1]))
{
x<-sub("havent","not",x)
i<-i+1
}
return (x)
}
source_data<-repNot(source_data)

3. Abbreviations;

Sometimes handling abbreviations makes sense, certain scenarios may need expansion of abbreviations and some may need creation of abbreviations. Whatever the case, the below snippet can be used by just reversing the abbreviations and expansions.

# setting up an abbreviation lookup
# reverse the vector for expansion logic
abbrev<-c("lol","omg","imo","gtfo")
abbrev<-cbind(abbrev,c("laugh out loud","oh my god",
"in my opinion","get the out"))
abbrev

# function for abbreviations
findAbbrev <- function(x) {
i<-1
while(i<=length(abbrev[,1]))
{
x<-sub(abbrev[i,1],abbrev[i,2],x)
i<-i+1
}
return (x)
}

text <- "Lol, OMG, IMO this is not a drill, haven't you heard"
text <- rbind(text,"LOL, Am I wrong, GTFO")
text<- Corpus(VectorSource(text),
readercontrol=list(reader=readPlain))
text <- tm_map(text , tolower)
text <- tm_map(text , removePunctuation)
text

# text<-sapply(text,function(x) findAbbrev(x))
text<-findAbbrev(text)
text

to be continued…,

Building Hellinger Distance Decision Trees

If you have stumbled on this page, I assume there is still no Package in R for Hellinger Decision Trees.

When I wanted to build a model using Hellinger Tree, my natural instinct was to search for a package in R. When I found none, I searched for its implementation in any other Language, I couldn’t find any implementation snippets. So, this is my post that could save your valuable time.

So, using this jar, I had updated the base Weka jar and found the Hellinger Tree under the classify tab in Weka Explorer but couldn’t work with it somehow, Maybe because I was new to Weka. I didn’t have the time to learn Weka and I needed to build the model ASAP with only hellinger distance decision tree and quickly. I used the jar to build the tree in our old dependable friend JAVA.

Weka classifiers work best with .arff files. Data can be supplied to them in other formats but arff files are native to Weka. We can use RWeka package in R to export training, testing and other datasets in .arff files.

library(RWeka)
write.arff(data_train,"train.arff",eol = "\n")

Now for the sample JAVA code for Hellinger Tree based Model

import weka.classifiers.Evaluation;
import weka.classifiers.trees.HTree;
import weka.core.Instance;
import weka.core.Instances;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class hellingerDT {

/**
* @param args
* @throws FileNotFoundException
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub

//Reading data from training, test, classify arff files
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("location/train.arff"));
Instances train = new Instances (breader);
//Setting the first column as class variable, Indicating to the model that this is the
train.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/test.arff"));
Instances test = new Instances (breader);
test.setClassIndex(0);

breader = new BufferedReader(new FileReader("location/classify.arff"));
Instances classify = new Instances (breader);
classify.setClassIndex(0);

breader.close();

//Instantiate a Hellinger Tree model
HTree hT = new HTree();

//Train the model
hT.buildClassifier(train);

//Evaluating a model, using test data
Evaluation eval = new Evaluation(train);
eval.evaluateModel(hT, test);
//Display the metrics
System.out.println(eval.toSummaryString("Review Classification Hellinger Tree", true));
System.out.println("Precision " + eval.precision(1)*100+" and Recall "+eval.recall(1)*100);

//Printing the Tree
System.out.println(hT.graph());

//Classifying New Data
for (int i = 0; i < classify.numInstances(); i++) {
double pred = hT.classifyInstance(classify.instance(i));
System.out.println(pred);
}

}

}

MonteCarlo Simulation in R

Let us try a complex probability calculation, The problem statement is What is the probability of having at least one girl child in a family which may have either 1, 2, 3, 4 children. This as you see is a complex calculation for a non-mathematician and as a simple being I would rather employ Montecarlo simulation technique. What this technique does states is if you simulate a very large number of experiments then the probability of the events approaches the correct value. So Let’s do montecarlo for the above problem in R.

# Simulating 10000 families
a<-rep(1,10000)
# Simulating number of Kids in each family
b<-sapply(a,function(x) sample(1:4,1))
# Simulating male and female kids in each family 1=female, 0=male
c<-sapply(b, function(x) sample(0:1,x, replace = TRUE))
# Finding out families with atleast one female kid
which(sapply(c,sum)>=1) # I got 7631
# probality is calculated
7631/10000 # 0.7631

That was it. The probability is ~0.7631.

Looping in R (without For or While statements)

Salutations Reader,

The other day I found a guy who made a statement, which was something like “If you are using “for” in R you aren’t using it properly”. Firstly, I was surprised on this. How do I write loops without a for? How can I do Monte Carlo Simulations without for statement? Then I got to thinking on what the guy meant. R is good at vector manipulation. Can I use this to avoid loops? Let me show this with a Monte Carlo Simulation to calculate the probability of a Head in Coin Toss(Assumption: Coin is unbiased and there is justice in the universe).

Flipping 100 coins;

a<-rep(1,100)
a<-sapply(a,function(x) sample(0:1,x))

probability=sum(a)/100= 0.51

Flipping 100000 coins;

b<-rep(1,100000)
b<-sapply(b,function(x) sample(0:1,x))

probability=sum(b)/100000=0.499

Now I understand why the guy said what he said. Now Go forth and simulate!!!!

Resolving Infinite Hierarchy levels using Recursive Queries

Typically,

Most Reporting Tools do not support recursive querying. This is critical for scenarios where you need to traverse through a table multiple times during run time. One simple scenario is shown below;

Let us take Adam’s Family Tree example;

Any O.L.T.P application built to display this data will probably be storing it in the following way;

Why? To save space, improve Performance.

create table parent_child (parent varchar(max), child varchar(max))

insert into parent_child values(‘Adam’,’Abel’);
insert into parent_child values(‘Adam’,’Cain’);
insert into parent_child values(‘Adam’,’Seth’);
insert into parent_child values(‘Cain’,’Enoch’);
insert into parent_child values(‘Enoch’,’Irad’);
insert into parent_child values(‘Irad’,’Mehujael’);
insert into parent_child values(‘Seth’,’Enos’);
insert into parent_child values(‘Enos’,’Cainan’);
insert into parent_child values(‘Cainan’,’Mahaleel’);
insert into parent_child values(‘Mahaleel’,’Jared’);

select * from parent_child;

But the users may need a report showing the Ancestry details and the generation level which the Application may show using data structures, Arrays and run time coding. How, Can a reporting tool do this? You can create specialized tables but can you support infinite levels? Not all table entities will have the same number of levels leading to nulls in columns.

Solution: Databases support Recursive Queries. And Most of them follow standard ANSI syntax which means one code can run on multiple databases.

The Ancestry Details can now be found as;

WITH cte_name ( Ancestor, Descendent ,lvl )
AS
(
select PARENT, CHILD, 1 as lvl
from
parent_child
UNION ALL
select cte_name.Ancestor, parent_child.CHILD, lvl+1 from
parent_child inner join cte_name on cte_name.descendent=parent_child.PARENT
)
SELECT *
FROM cte_name order by Ancestor,lvl

This is only one type of scenario that has been solved. We can use the Recursive queries to solve other run time scenarios as well.

BO Universe Prompt doesn’t filter Multi byte Characters – Crystal Reports

A lot of times we create crystal reports on Business Objects Universes. We use the condition objects with @prompt to filter out data on particular columns. Sometimes we have multi byte characters in Databases like Japanese, Chinese characters which don’t get filtered properly unless the IsUnicode radio button is enable.

So for Reports built for multiple languages, this radio button is pretty mandatory. Also enable the Property IsUnicode at the universe parameters to be on the safer side. This appends an ‘N’ before fields of nvarchar datatype.

 

DB Session Management

Session Management is very important for Database Administrators. Hosting a database is not always fun as people accessing the database can do all kinds of stuff to it. These sessions need to be sometimes killed for the greater good. Below is session management for a few databases.

For Oracle Database Session Management;

Make sure you are have admin credentials. Use V$Session View for Session Management for Oracle Database.

select * from v$session;

or specifically

select sid,serial#,username,machine,program from v$session; to avoid unnecessary complicated data.

Now to kill the unnecessary session. Using the SID and Serial # of the session you can kill the session as below;

alter system kill session ‘SID,Serial #’;

e.g., alter system kill session ‘13,10196’;

For MS SQL Server 200x;

SQL Server has the simplest Session monitoring there is. Connect to the DB server from SQL Server Management Studio. Right Click on the Server and select “Activity Monitor”. In the resulting window expand the “Processes” pane to get the sessions running. Right click on the desired pesky session and select “Kill Process”.

I will add other Databases session management info soon….

Datawarehousing for NewBIes – Part 2

So now that you have all the data in your warehouse, you do not know how to connect data from one department to the other. for e.g., Sales Team sold your products but the Income is handled by the Finance Team, your Inventory is managed by the Operations team but all the money required for the process is handled by the Finance Team etc., Now you have to find the common points to link your data from different points.

To do this you can get an Entity Relationship Diagram depicting the Entities in your organization. Entities being your various parts of the organization.

Entity Relationship Diagrams

Here you understand what the ID’s are in each department and whether these ID’s are replicated in other departments for tracking. If they are then you are in luck otherwise you need to get ready for some serious remodeling of the organization’s operation.

These IDs or unique identifiers for your entity can be called Dimensions. Dimensions are your concrete Entities.

For e.g.,  In your Sales Department, your Dimensions would be Customers who buy products, Products which are sold, Employees in the Sales Department etc., The most layman way to find out whether an entity can be a Dimension or not is whether it has different qualities or not. Customers can have  Contact Information, Type etc., Any thing that can hold its own existence in the warehouse can be put under a Dimension. for e.g., Orders placed by customer cannot be dimensions as they are transactions and do not have an existence unless a Dimension (Customer) creates one.

Once you have your Dimensions you can correlate and consolidate dimensions from various data sources/departments. You can also create a hybrid Dimension called Employees from Employees Dimension of the different Departments.

Now that you can understand how you data resides in the warehouse you decide what your data warehouse boundaries can be. How your departments want to share their data. This will lead to a very important topic of Data marts/Data marts… Dun Dun Daaan.

More on it in the next post. Same place and same channel……..

 

Datawarehousing for NewBIes – Part 1

Before you say “I see what you did there”, this post is aimed at all the new comers to Data warehousing and Business Intelligence Technologies. Most of us learn the tools first and then get to understand what data warehousing and business intelligence in practice and frankly it should be the other way around. This post is like data warehousing for Dummies which I was one.

First off, the definitions;

Data Warehouse: A typical humungous Database where you maintain your data. It can be historical like your sales from the past 10 years to everyday operational data like no.of hits on your website by different people. In layman terms it is a big database where you store all your organization’s relevant data.

Data Warehousing: It involves creating efficient data warehouses so that their users can benefit from it by getting their questions answered.

Typical Scenario; Your client has a huge organization and it has many departments like Finance, Operations, Service, Sales etc., The client has a board of members who makes decisions and need structured data to make these decisions. If you hand them databases of different departments they would go insane. If you give them different reports from different departments they would need additional time to correlate data.

This is when you step up and say “I will build you a data warehouse and by the power of the data warehouse you will make decisions”.

Your journey begins as below:

You go to each department and understand what the hell they do and where they store their data. After a lot of meetings and unending supply of coffee you have all the inputs you need.

You understand that you can take all the departments’ data and just dump it in a single place to start off. You use various data extraction and cleansing tools and get the data into one place.

Now you see that the data still has no connection. You start meeting again with the departments to understand how they correlate and collaborate their data with the other departments.

  (…………………………..To be continued)

Business Objects Office Prank

Hey there,

Ever wanted to write a hate mail to your boss, a letter to that special cubicle mate or just pull a plain old prank on you teammates under total anonymity? who doesn’t right… If you have Business Objects installed on any server (not your workstation…) its almost impossible to track the perpetrator unless you have friends in the mail admin team. Follow the steps below to have fun;

1. Login to Business Objects CMS of the server with the common user id.

2. Right click on a report and select schedule and set the properties for recurrence, select hourly and every X minutes to really irritate.

3. Now for the important part, select destination option and in the Destination box select Email. Deselect the ‘Use Default Settings” and you will be presented with the options and you can frame the mail. You can even set the “From” email id to anything you want.

4. Hit Schedule button and avoid giggling every time someone mentions business objects.

This prank was played on my teammate and we have yet to find out the prankster…….