Tuesday, January 31, 2012

Hadoop, HBase in Amazon Cloud (AWS)

Hadoop & HBase can be configured in Amazon Cloud to take the advantage of Distributed Computing where one can quickly start new instances as per requirement (or) load to an application at any time.


Elastic Compute Cloud  (EC2) : It provides easy access and configurable instances required to scale application easily.
Elastic Map Reduce (EMR) : It provides support for Hadoop to run Map-Reduce jobs on top of EC2 and S3 data storage.
Simple Storage Service (S3) : A persistent data storage service in Amazon Cloud with high availability.


Out of many running Amazon Machine Image (AMI), one acts as Namenode and rest as DataNode for Hadoop. NameNode or MasterNode contains HBase HMaster running on it which uses 'ssh', public-private-key to communicate to other SlaveNode. JobTracker talks to the NameNode and gets location of Data and it tells TaskTracker to perform the actual processing of Data on several DataNode/SlaveNode. HBase keep same data on many DataNode( equal to replication value defined HBase configuration) for faster access. When a table size/hit increases rapidly, it divides it into two to handle the bottleneck for any DataNode. 


Locally configured Hadoop, HBase can be converted to Amazon AMI and deployed to Cloud directly with any number of same instance. 

Monday, January 23, 2012

How to : working with HBase Filter API

  • org.apache.hadoop.hbase.filter
        HBase provides Filter to perform searches for RowKey, Column Family, Column & Column Values. It could be anything from Binary comparator, Column Value, Prefix Filter, Regex to Timestamp filters etc.
A complete list is available at HBase Filter API documentation page.


You can create a FilterList which may contain number of FilterList as its child. Each Filter in a FilterList evaluates on either FilterList.Operator.MUST_PASS_ONE or FilterList.Operator.MUST_PASS_ALL .


Once Filter/FilterList is created  (eg: yourFilterList.addFilter(yourFilter); ), You should add it to Scan object (eg: scan.setFilter(yourFilterList); ). that's it. At this point you are ready to call getScanner method with HTable object instance. It returns you a iterable ResultScanner which contains number of rows matching your Filter criteria.