Hbase is a hadoop eco-system component, Hbase is a column oriented database(NOSql Database) which uses in-memory processing to impart some quick-reads and writes capability to the Write Once Read Many Times rigidity of Hadoop. Hbase like any other columnar database uses a row identifier and column families.
The most basic gene of an RDBMS is a tuple. An RDBMS table is a collection of tuples. There is no identity below a Tuple, (cells on their own can’t be an entity, they are part of a tuple in RDBMS). Because of the aforementioned design principle an RDBMS table should have a fixed structure, updating it would mean updating all the tuples. Columnar Databases cleverly circumvent this by defining the basic gene as a cell. Each cell has a identity and membership to a row in a table, because of this freedom is granted to have different rows with different structures.
Advantages of Columnar databases;
1. Very good for Sparse Data scenarios
2. Reduction of Tables used to describe an Entity(Merging all the RDBMS tables related to one entity into one single table is possible)
3. Store Version history by default. Built with inherent SCD capability.
Limitations of Hbase;
1. Database information retrieval is not sql oriented. Uses get/put/scan type statements.
Below are some of the important commands to get started;
list –> Lists all the tables in the HBase DB
create –> Used to create tables, we need to specify at least one column family other wise table creation is not allowed.
e.g., create ‘tx’, ‘cf1’
describe –> Used to display Table stats.
e.g., describe ‘tx’
put –> Used to insert data into tables. Need to specify the row identifier so that the appropriate row is updated.
e.g., put ‘tx’,’row1′ ,’cf1:col1′,’col1′
put ‘tx’,’row1′ ,’cf1:col2′,’col2′
get –> Used for information retrieval from the table. We need to specify the row identifier when using get. We can also limit results by specifying the column family to choose.
e.g., get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}
scan –> display specifications of the said table
e.g., scan ‘tx’
disable –> Used to disable the table. Altering a table isn’t allowed while it is enabled. Using disable it should be disabled and then updated. A table is not available for query when it is disabled.
e.g., disable ‘tx’
enable –> Used to enable a disabled table. Disabled tables aren’t available for querying.
e.g., enable ‘tx’
drop –> deleting the said table.
e.g., drop ‘tx’
Hadoop Walkthough;
Using the commands above, we have created a table ‘tx’, Now let us do some operations on them;
Inserting/Updating rows into the table;
put ‘tx’,’row1′ ,’cf1:col1′,’col1-‘
put ‘tx’,’row2′ ,’cf1:col1′,’col21′
In the below command line we are adding a new column to cf1 family during run time.
put ‘tx’,’row2′ ,’cf1:col3′,’col31′
Retrieving data entirely/column specific from the table;
get ‘tx’,’row2′
get ‘tx’,’row1′,{COLUMN => [‘cf1:col1’]}
Schema Reduction Example;
Using the classic example of an EMPLOYEE table, we can explore on how three tables EMPLOYEE, TEAM, Financial tables used in OLTP and RDBMS OLAP tables which define the three entities as three tables to attain ACID, storage, and performance reasons. Hbase on the other hand can build a single table with employee as the single entity and other entities as its properties(column families). We can add any no.of columns on the fly to any of the entities(column families). An RDBMS can also create a very large table but it would then be a large table problems sparseness, rigid schema.
create ’employee’, {NAME=>’details’,VERSIONS => 5},{NAME=>’team’,VERSIONS => 6}
put ’employee’,’emp1′,’details:name’,’emp_1′
put ’employee’,’emp1′,’details:id’,1
put ’employee’,’emp1′,’team:id’,10
put ’employee’,’emp1′,’team:name’,’coke’
put ’employee’,’emp1′,’team:name’,’pepsi’
get ’employee’,’emp1′,{COLUMN=>’team’,VERSIONS=>2}
disable ’employee’
alter ’employee’, ‘financial’
enable ’employee’
get ’employee’,’emp1′,{COLUMN=>’financial’,VERSIONS=>15}
that is Hbase at 30,000 feet…..