Time Savers

This article is a list of time saving code snippets;

1. List out all files in a folder using R;


file.list <- list.files('C:/Users/MyName/Desktop/')
write.table(file.list,'C:/Users/MyName/Desktop/files.txt')

2. Creating a bunch of variables without any Manual intervention;

mylist <- c(1:10)
for(i in mylist){
assign(paste('movie',i,sep=''),i+10)
}

I will keep adding more.

Unzipping a humungous number of files using R

Hi,

Assume you have this situation;

zip files

There are 400 such zip files and you want to unzip them. Imagine you are in an R session and do not want to exit R just for this. How do we achieve this? Can we do this quickly and efficiently? Yes, we can.

A combo of foreach, doSNOW/doMC and utils libraries will do the trick.

rm(list=ls())

# setting up working directory
setwd('/root/Desktop/multi-modal-train')

# reading list of all zip files
zip.files <- list.files('/root/Desktop/multi-modal-train', include.dirs = FALSE)
zip.files

# loading required libraries
library(foreach)
# Windows fork
# library(doSNOW)
# c1 <- makeCluster(1)
# registerDoSNOW(c1)

# linux fork
library(doMC)
registerDoMC()

# setting operation variables
base.dir <- '/root/Desktop/multi-modal-train'
i <- 1

# parallel loops for faster processing
# unzipping files can proceed in parallel
foreach(i=1:(length(zip.files)-1)) %dopar%{
unzip(paste(base.dir,zip.files[i],sep='/'),exdir = 'unzipped')
cat(paste('unzipping',zip.files[i],sep = '-'))
gc(reset=TRUE)
}

# Windows fork
# stopCluster(c1)

The output is below;
final

Happy R’ing

How to convert pixelmap to a Picture – JAVA

Hi,

As part of the Kaggle Challenge, I have had to visualize training data which is a bunch of numbers with pixel intensity. To convert this into a picture file like jpeg or bmp I needed convertors. I found some JAVA code online but it didn’t solve my problem. I needed to tweak it a bit to make it work for my example.

The input to the program is a pixel map like this;

Pixel Map

To convert this into a picture, you can use the below code;

note: My pixel map is of 2304 pixel intensities which translates to 48*48 Pixel image. Hence, the code below works with this. For other dimensions update the pic accordingly;

import java.awt.image.BufferedImage;
import java.awt.image.WritableRaster;
import java.io.File;
import javax.imageio.ImageIO;

public class pixtoImage {

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
File imageFile = new File("D:/Pics/0-6.png");
//3 bands in TYPE_INT_RGB
int NUM_BANDS = 3;
int[] pixelMap = new int[48 * 48 * NUM_BANDS];
int band;
for (int i = 0; i < 48; i++)
{
for (int j = 0; j < 48; j++) {
for (band = 0; band < NUM_BANDS; band++){
pixelMap[((i * 48) + j)*NUM_BANDS + band] = Integer.parseInt(args[((i * 48) + j)]);

}
}
}
BufferedImage picImage = getImageFromArray(pixelMap, 48, 48);
try{
ImageIO.write(picImage, "png", imageFile);
}
catch(Exception e){
e.printStackTrace();
};
System.out.println("Written");
}

public static BufferedImage getImageFromArray(int[] pixels, int width, int height)
{
BufferedImage image =
new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
WritableRaster raster = (WritableRaster) image.getData();
raster.setPixels(0, 0, width, height, pixels);
image.setData(raster);
return image;
}
}

Output for the pixel map shown in the image above is

0-6

May this snippet save you some time, Have fun…..

R SMOTE Function – Reminder

SMOTE is a wonderful function in R. It belongs to the package DMwR. It does something very unique. It helps in filling under balanced classes. If your data is imbalanced, SMOTE helps generate more examples of the imbalanced class using a argument specified value based nearest neighbors. There’s one catch though. It cannot work with multi class data. It works only with two classes. You will need to do some special data transformations to smote it.

Try testing your multi class data with the below syntax;

train_cl1s<-SMOTE(V1 ~ ., train_cl1, perc.over = 1500, perc.under = 0, k=11)

now try to check the resultant data frame. It will consist only of one class and not more than it as SMOTE recognizes only two classes.

Hbase 101 and Tutorial

Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.

The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.

Advantages of Columnar databases;

1. Very good for Sparse Data scenarios

2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)

3. Store Version history by default. Built with inherent SCD capability.

Limitations of Hbase;

1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.

Below are some of the important commands to get started;

list –> Lists all the tables in the HBase DB

create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.

e.g., create ‘tx’, ‘cf1’

describe –> Used to display Table stats.

e.g., describe ‘tx’

put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.

e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′

put ‘tx’,’row1′ ,’cf1:col2′,’col2′

get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.

e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

scan –> display specifications of the said table

e.g., scan ‘tx’

disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.

e.g., disable ‘tx’

enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.

e.g., enable ‘tx’

drop –> deleting the said table.

e.g., drop ‘tx’

Hadoop Walkthough;

Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;

Inserting/Updating rows into the table;

put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘

put ‘tx’,’row2′ ,’cf1:col1′,’col21′

In the below command line we are adding a new column to cf1 family during run time.

put ‘tx’,’row2′ ,’cf1:col3′,’col31′

Retrieving data entirely/column specific from the table;

get ‘tx’,’row2′

get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

Schema Reduction Example;

Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.

create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}

put ’employee’,’emp1′,’details:name’,’emp_1′

put ’employee’,’emp1′,’details:id’,1

put ’employee’,’emp1′,’team:id’,10

put ’employee’,’emp1′,’team:name’,’coke’

put ’employee’,’emp1′,’team:name’,’pepsi’

get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}

disable ’employee’

alter ’employee’, ‘financial’

enable ’employee’

get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}

that is Hbase at 30,000 feet…..

DB Session Management

Session Management is very important for Database Administrators. Hosting a database is not always fun as people accessing the database can do all kinds of stuff to it. These sessions need to be sometimes killed for the greater good. Below is session management for a few databases.

For Oracle Database Session Management;

Make sure you are have admin credentials. Use V$Session View for Session Management for Oracle Database.

select * from v$session;

or specifically

select sid,serial#,username,machine,program from v$session; to avoid unnecessary complicated data.

Now to kill the unnecessary session. Using the SID and Serial # of the session you can kill the session as below;

alter system kill session ‘SID,Serial #’;

e.g., alter system kill session ‘13,10196’;

For MS SQL Server 200x;

SQL Server has the simplest Session monitoring there is. Connect to the DB server from SQL Server Management Studio. Right Click on the Server and select “Activity Monitor”. In the resulting window expand the “Processes” pane to get the sessions running. Right click on the desired pesky session and select “Kill Process”.

I will add other Databases session management info soon….

Datawarehousing for NewBIes – Part 2

So now that you have all the data in your warehouse, you do not know how to connect data from one department to the other. for e.g., Sales Team sold your products but the Income is handled by the Finance Team, your Inventory is managed by the Operations team but all the money required for the process is handled by the Finance Team etc., Now you have to find the common points to link your data from different points.

To do this you can get an Entity Relationship Diagram depicting the Entities in your organization. Entities being your various parts of the organization.

Entity Relationship Diagrams

Here you understand what the ID’s are in each department and whether these ID’s are replicated in other departments for tracking. If they are then you are in luck otherwise you need to get ready for some serious remodeling of the organization’s operation.

These IDs or unique identifiers for your entity can be called Dimensions. Dimensions are your concrete Entities.

For e.g.,  In your Sales Department, your Dimensions would be Customers who buy products, Products which are sold, Employees in the Sales Department etc., The most layman way to find out whether an entity can be a Dimension or not is whether it has different qualities or not. Customers can have  Contact Information, Type etc., Any thing that can hold its own existence in the warehouse can be put under a Dimension. for e.g., Orders placed by customer cannot be dimensions as they are transactions and do not have an existence unless a Dimension (Customer) creates one.

Once you have your Dimensions you can correlate and consolidate dimensions from various data sources/departments. You can also create a hybrid Dimension called Employees from Employees Dimension of the different Departments.

Now that you can understand how you data resides in the warehouse you decide what your data warehouse boundaries can be. How your departments want to share their data. This will lead to a very important topic of Data marts/Data marts… Dun Dun Daaan.

More on it in the next post. Same place and same channel……..

 

Datawarehousing for NewBIes – Part 1

Before you say “I see what you did there”, this post is aimed at all the new comers to Data warehousing and Business Intelligence Technologies. Most of us learn the tools first and then get to understand what data warehousing and business intelligence in practice and frankly it should be the other way around. This post is like data warehousing for Dummies which I was one.

First off, the definitions;

Data Warehouse: A typical humungous Database where you maintain your data. It can be historical like your sales from the past 10 years to everyday operational data like no.of hits on your website by different people. In layman terms it is a big database where you store all your organization’s relevant data.

Data Warehousing: It involves creating efficient data warehouses so that their users can benefit from it by getting their questions answered.

Typical Scenario; Your client has a huge organization and it has many departments like Finance, Operations, Service, Sales etc., The client has a board of members who makes decisions and need structured data to make these decisions. If you hand them databases of different departments they would go insane. If you give them different reports from different departments they would need additional time to correlate data.

This is when you step up and say “I will build you a data warehouse and by the power of the data warehouse you will make decisions”.

Your journey begins as below:

You go to each department and understand what the hell they do and where they store their data. After a lot of meetings and unending supply of coffee you have all the inputs you need.

You understand that you can take all the departments’ data and just dump it in a single place to start off. You use various data extraction and cleansing tools and get the data into one place.

Now you see that the data still has no connection. You start meeting again with the departments to understand how they correlate and collaborate their data with the other departments.

  (…………………………..To be continued)