Hbase 101 and Tutorial

Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.

The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.

Advantages of Columnar databases;

1. Very good for Sparse Data scenarios

2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)

3. Store Version history by default. Built with inherent SCD capability.

Limitations of Hbase;

1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.

Below are some of the important commands to get started;

list –> Lists all the tables in the HBase DB

create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.

e.g., create ‘tx’, ‘cf1’

describe –> Used to display Table stats.

e.g., describe ‘tx’

put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.

e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′

put ‘tx’,’row1′ ,’cf1:col2′,’col2′

get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.

e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

scan –> display specifications of the said table

e.g., scan ‘tx’

disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.

e.g., disable ‘tx’

enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.

e.g., enable ‘tx’

drop –> deleting the said table.

e.g., drop ‘tx’

Hadoop Walkthough;

Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;

Inserting/Updating rows into the table;

put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘

put ‘tx’,’row2′ ,’cf1:col1′,’col21′

In the below command line we are adding a new column to cf1 family during run time.

put ‘tx’,’row2′ ,’cf1:col3′,’col31′

Retrieving data entirely/column specific from the table;

get ‘tx’,’row2′

get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}

Schema Reduction Example;

Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.

create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}

put ’employee’,’emp1′,’details:name’,’emp_1′

put ’employee’,’emp1′,’details:id’,1

put ’employee’,’emp1′,’team:id’,10

put ’employee’,’emp1′,’team:name’,’coke’

put ’employee’,’emp1′,’team:name’,’pepsi’

get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}

disable ’employee’

alter ’employee’, ‘financial’

enable ’employee’

get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}

that is Hbase at 30,000 feet…..

Resolving Infinite Hierarchy levels using Recursive Queries

Typically,

Most Reporting Tools do not support recursive querying. This is critical for scenarios where you need to traverse through a table multiple times during run time. One simple scenario is shown below;

Let us take Adam’s Family Tree example;

Any O.L.T.P application built to display this data will probably be storing it in the following way;

Why? To save space, improve Performance.

create table parent_child (parent varchar(max), child varchar(max))

insert into parent_child values(‘Adam’,’Abel’);
insert into parent_child values(‘Adam’,’Cain’);
insert into parent_child values(‘Adam’,’Seth’);
insert into parent_child values(‘Cain’,’Enoch’);
insert into parent_child values(‘Enoch’,’Irad’);
insert into parent_child values(‘Irad’,’Mehujael’);
insert into parent_child values(‘Seth’,’Enos’);
insert into parent_child values(‘Enos’,’Cainan’);
insert into parent_child values(‘Cainan’,’Mahaleel’);
insert into parent_child values(‘Mahaleel’,’Jared’);

select * from parent_child;

But the users may need a report showing the Ancestry details and the generation level which the Application may show using data structures, Arrays and run time coding. How, Can a reporting tool do this? You can create specialized tables but can you support infinite levels? Not all table entities will have the same number of levels leading to nulls in columns.

Solution: Databases support Recursive Queries. And Most of them follow standard ANSI syntax which means one code can run on multiple databases.

The Ancestry Details can now be found as;

WITH cte_name ( Ancestor, Descendent ,lvl )
AS
(
select PARENT, CHILD, 1 as lvl
from
parent_child
UNION ALL
select cte_name.Ancestor, parent_child.CHILD, lvl+1 from
parent_child inner join cte_name on cte_name.descendent=parent_child.PARENT
)
SELECT *
FROM cte_name order by Ancestor,lvl

This is only one type of scenario that has been solved. We can use the Recursive queries to solve other run time scenarios as well.

DB Session Management

Session Management is very important for Database Administrators. Hosting a database is not always fun as people accessing the database can do all kinds of stuff to it. These sessions need to be sometimes killed for the greater good. Below is session management for a few databases.

For Oracle Database Session Management;

Make sure you are have admin credentials. Use V$Session View for Session Management for Oracle Database.

select * from v$session;

or specifically

select sid,serial#,username,machine,program from v$session; to avoid unnecessary complicated data.

Now to kill the unnecessary session. Using the SID and Serial # of the session you can kill the session as below;

alter system kill session ‘SID,Serial #’;

e.g., alter system kill session ‘13,10196’;

For MS SQL Server 200x;

SQL Server has the simplest Session monitoring there is. Connect to the DB server from SQL Server Management Studio. Right Click on the Server and select “Activity Monitor”. In the resulting window expand the “Processes” pane to get the sessions running. Right click on the desired pesky session and select “Kill Process”.

I will add other Databases session management info soon….