Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

Hive : A data warehouse system on top of hadoop for adhoc query.

  • Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data. 
  • Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.
  • Extending Mapper : By default hive support few inputFormat where every line is  separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs.  Hive provides us to extend GenericUDTF (User Defined Transform Functionclass for this purpose. 
  • Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.
These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.

Thank You

What is Apache Sqoop?

Apache Sqoop is a tool to bulk import/export data into Hadoop ecosystem (HDFS, HBase or Hive).

  • works with number of databases and commercial data-warehouses.
  • available as command line tool, can be used in java with passing appropriate arguments.
  • graduated from the Incubator & became top-level-project in ASF.
Current version of Sqoop does a map-only job with all the transformation happen in map task. ( Sqoop2 will possibly have reduce task as well)
Fig : Sqoop 2 Architecture diagram taken from Cloudera.com
Example : 
alok@ubuntu:~/apache/sqoop-1.4.2$ bin/sqoop import --connect jdbc:mysql://<hostname>:3306/<dbname> --username <user> -P --driver com.mysql.jdbc.Driver --table <tablename> --hbase-table <hbase-tablename> --column-family <hbase-columnFamily> --hbase-create-table

or use it like this in your java programs - 
ArrayList<String> args = new ArrayList<String>();
args.add("--connect");
args.add("jdbc:mysql://<hostname>:3306/<dbname>");
args.add("--username");
args.add("<user>");
args.add("--driver");
args.add("com.mysql.jdbc.Driver");
args.add("--table");
args.add("<tablename>");
args.add("--hbase-table");
args.add("<hbase-tablename>");
args.add("--column-family");
args.add("<hbase-colFamilyName>");
args.add("--hbase-create-table");
args.add("--num-mappers");
args.add("2");

int ret = Sqoop.runTool(args.toArray(new String[args.size()]));
  • Sqoop can write data directly to HDFS or HBase or Hive.
  • It can also export data back to RDBMS tables from Hadoop.
  • Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.


Monday, July 23, 2012

How to : working with HBase Coprocessor

HBase Coprocessor : It allows user code to get executed at each region(for a table) in region server. Clients only get the final responses from every region. HBase provides AggregateProtocol to support common aggregation (sum,avg,min,max,std) functionality.

Coprocessor framework is divided into : Endpoint : It allows you to write your own pluggable class which extends BaseEndpointCoprocessor and can have any number of methods which you want to be executed at table region server. Method executes much faster at regionserver and minimizes the network load as only results get transmitted to the client. Client need to do the final reduction on results returned by each region server.

Example : Below example illustrates just call to HBase coprocessor, A separate 'GroupByAggregationProtocol' interface extending 'CoprocessorProtocol' with methods required and Actual implementing class which implements 'GroupByAggregationProtocol' and extends 'BaseEndpointCoprocessor' must be created and deployed in each regionserver.
Map<byte[], Map<String, List<Long>>> resultFromCoprocessor = table
        .coprocessorExec(GroupByAggregationProtocol.class,
        <start-RowKey>,  // byte array or can be null
        <end-Rowkey>,   // byte array or can be null
        new Batch.Call<GroupByAggregationProtocol,  Map<String, List<Long>>>() {
             @Override
             public Map<String, List<Long>> call(GroupByAggregationProtocol aggregation)  throws IOException {
                return aggregation.getGroupBySum(filterList, scan);
             }
});
for (Map.Entry<byte[], Map<String, List<Long>>> entry : resultFromCoprocessor
 .entrySet()) {
 Map<String, List<Long>> en = entry.getValue();
 // Iterate through results from each regionserver   ......
       }    
}
Endpoint Coprocessors can be assumed as stored procedure in RDBMS.
Observers : It provides a hook to override few default methods of HBase when a event occurs.
It can be at three sub-levels
a) RegionObserver : handles/override Get, Put, Delete, Scan, and so on. It can be of type pre or post (eg : preGet, postDelete etc.)
b) MasterObserver : handles table creation, deletion and alter events. eg : preCreateTable or postCreateTable.
c) WALObserver : handles write-ahead log creation events.
eg : preWALWrite or postWALWrite .

Observer Coprocessors can be assumed as triggers in RDBMS.

Wednesday, June 27, 2012

What is IaaS, Paas & SaaS?

These are three main Cloud Computing Stack : Infrastructure as a Service, Platform as a Service and Software as a Service.
  • SaaS applications are designed for end-users, accessible over the web.
  • PaaS are set of tools and services to help developers design, develop, build & deploy application quickly.
  • IaaS serves the need of storage, hardware, servers and networking components.
These cloud computing stacks can provide : 
  • any Service on-demand.
  • any Platform on-demand.
  • large Infrastructure on-demand.
Elasticity of cloud computing brings scalability & accessibility to the applications. 

Examples : Amazon AWS (EC2), Google Cloud (Gmail), Microsoft Azure(Sky Drive) etc.

Tuesday, June 26, 2012

Java RMI : Remote Method Invocation

Remote Method Invocation (RMI) : It allows an object running in one Java virtual machine (say a client machine)  to invoke methods on an object running in another Java virtual machine (a Server machine).

  • Server accepts tasks from clients, runs the tasks, and returns any results. The server code consists of an interface and a class. The interface defines the methods that can be invoked from the client. 
  • RMI interface extends the interface java.rmi.Remote, and each method declares java.rmi.RemoteException in its throws clause. 
  • Server register its remote objects with RMI's simple naming facility, the RMI registry.
  • Client program obtains a stub for the registry on the server's host, looks up the remote object's stub by name in the registry, and then invokes method on the remote object using the stub.
  • A Serializable object can be passed to-and-fro Client-Server.
  • Source files can be compiled like :
    javac -d destDir RMIInterface.java RMIInterfaceImpl.java Client.java
RMI stub : In simple, its a proxy or surrogate which helps in managing invocation of remote objects.

Monday, June 18, 2012

Agile Scrum Methodology

What is Agile : Able to move Quickly or Easily 
In Software Industry or Product Development, it says team should divide their task in Iteration and rapidly start working once design (most IMP. spend as much time you have to make it better) is ready.
The Role Of Scrum
Scrum has three fundamental roles: Product Owner, Scrum Master, and Team Member.
  • Product Owner is responsible for communicating the vision of the product and creates a prioritized wish list called a product backlog
  • Scrum Master acts as a liaison between the Product Owner and the team and keeps the team focused on its goal. S/He meets team each day in Daily Scrum to assess its progress.
  • Team Members are responsible for determining how tasks will be accomplished. They can select any work of their choice which they commit to finish.
More Scrum Terminologies 
  1. Sprint Planning
  2. Sprint Goal
  3. Sprint Backlog
  4. Sprint Burndown
  5. Sprint Review
Agile Scrum Benefits
  1. Shorter Delivery Cycles
  2. Customer Involvement via feedback 
  3. Self Organizing Team
  4. Adaptable to Change
Scrum Cycle repeats at end of every Sprint with newer Goal and prioritized tasks.


Thursday, May 31, 2012

Things to remember : In Map Reduce


Q 1. What is IdentityMapper?
A - An empty Mapper which directly writes key/value to the output.
         Mapper<K,V,K,V>
Q 2. What is InverseMapper?
A - A Mapper which swaps the <Key,Value> to <Value,Key>.
         Mapper<K,V,V,K>
Q 3. What is IdentityReducer?
A - It performs no reduction, directly writes key/value to the output.
         Reducer<K,V,K,V>
Q 4. What is Partitioner?
A - It runs after completion of Map Jobs. A custom Partitioner can be implemented to decide which key/value should go to which Reducer.
In Map-Reduce model, unique key 'K' with all its Iterable<V> should go to same Reducer.
Q 5. What are the uses of Combiner?
A - It helps in performing local aggregation on Map jobs output to reduce the ammount of data sent to any Reducer.
Q 6. Where Map outputs are stored?
A - Intermediate or Grouped Map output are stored in Sequence File(can be gzipped) on HDFS cluster.
Q 7. How to set number of mapper & reducer?
A - JobConf class object is used to set number of mapper and reducer.
JobConf is present in package org.apache.hadoop.mapred and extends org.apache.hadoop.conf.Configuration
public void setNumMapTasks(int n);// sets number of mapper Job
public void setNumReduceTasks(int n);// sets number of reducer Job
Q 8. What is ChainMapper?
A - It allows to use multiple Mapper class in single Map task.
Output of one mapper is passed to another mapper and so on.
Each Mappper get executed in chain.
Q 9. What is RegexMapper?
A - A Mapper that extracts text matching a regular expression.

Wednesday, May 23, 2012

Things to remember : In Core JAVA

Q 1. Can you tell, which Algorithm is used by HashMap/HashTable?
A - HashMap internally uses bucket to store key-value pair. When a key is passed to HashMap, it is not used as 'key' as it is! It gets converted to another HashKey using HashCode(). When same HashKey is generated for multiple key(s) (ie: Collision in HashMap/HashTable), It(another key-value pair) get stored in same bucket as next item( Each bucket is a Linked List, It can contain multiple key-value pair).
HashMap can take a 'initial Capacity' & 'load Factor' in its constructor. 
initial Capacity : number of bucket get created at the time of initialization. 
load Factor : number of buckets get increased when Items cross this load factor.
HashTable is a synchronized version of HashMap. But HashMap gives performance bonus as object is not accessed by multiple Threads. 
Q 2. Name some way of Inter Process Communication(IPC)?
A - These are :
  1. Socket
  2. Message Queue
  3. Pipe
  4. Signal
  5. File
  6. Remote Method Invocation (RMI)
  7. Shared Memory
  8. SOAP, REST, Thrift, XML, JSON
Q 3. What is Mutual Exclusion?
A - Mutual Exclusion in OS (Mutex) is a collection of techniques/algorithms for sharing resources so that concurrent uses do not conflict and cause unwanted interactions. One of the most commonly used techniques for mutual exclusion is the semaphore.
Q 4. What are Abstraction and Encapsulation?
A - Abstraction : Hiding away unimportant details of an object, focuses on outside view.
      Encapsulation : It is defined as the process of wrapping up the data members and member functions together into a single unit.



Monday, April 30, 2012

What is : Hadoop Sequence File?

Hadoop Sequence File : These are flat files consisting of binary Key-Value Pair. It can store any key-value pair as byte arrays.
3 Types : 
UnCompressed
Record-Compressed
Block-Compressed.

Need for Sequential File : Hadoop is meant for processing BigData. It has 64 MB default block size on any cluster. Which mean any file with size lesser than 64 MB will eventually occupy 64 MB physical space on disk storage.
      In practical, Applications deal with files with fewer KB. So, It is advantageous to keep number of such small file in sequential Key-Value pair, which allows programmers to run similar logic on each file found in a block with help of Map-Reduce Jobs.
    -org.apache.hadoop.io.SequenceFile Class provides Read, Write methods. It also grants provision for 'Sorting' of SequenceFile Keys.
Thank YOU

How to : choose between DOM, SAX or XMLStreamWriter

What is XML : I call it a "language for Internet", It help applications to communicate  seamlessly. At the same time it's in human readable format too. Any XML file typically contains 'Elements' and 'Attributes', which are also called XML 'Node'. 
(Element, CData, Comment, Attribute, Entity, Text are few examples of Node Type).
There are number of API(s) available to work with XML, and each has its positive and negative aspects.


W3C DOM : 
good for - random read and  write with XML nodes.
not suitable for - larger memory footprint, performance.
SAX :
good for : faster read access, it's lightweight.
not suitable for - writing/creating XML nodes.
XMLStreamWriter :
good for - streaming out XML while building it, useful in web services handling larger files.
not suitable for - random read or write, It's sequential, one-way, cursor like implementation. 


All what you need is to choose your API closest to your need. XMLStreamWriter is a good for all purpose. Most effective for mobile devices.

Thank YOU

Friday, March 30, 2012

What is SEG_Y? Headers and Traces.

SEG_Y is open standard file format for storing geophysical ( eg: seismic ) data. These are stored on magnetic tapes and usually of several Gigs in size.

  • Headers :
contains optional SEG_Y tape label.
next 3200 bytes contains EBCDIC headers.
next 400 bytes contains Binary headers.


  • Traces :
Traces contains Trace Header and Trace Data.
first 240 bytes contain Trace header.
next 4004 bytes contain Trace data.

Tuesday, February 21, 2012

SharePoint 2010 Products configuration wizard Errors & Fix


1. Exception – Failed to create the configuration database. An exception of type System.Security.Cryptography.CryptographicException was thrown. Additional exception information: The data is invalid.
Resolution – This has two steps
Step 1: Make sure that the “Network Service” account has full access to the “14” directory under %commonprogramfiles\Microsoft Shared\Web Server Extensions.
Step 2: Delete the registry key located under “SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Secure\FarmAdmin” Registry key and then run the SharePoint 2010 Products Configuration Wizard.
It is likely that this registry key is required to be cleared each time you run the wizard after an unsuccessful attempt :)


2. Exception - Failed to register SharePoint Services. An exception of type System.Runtime.InteropServices.COMException was thrown. Additional exception information: Could not access the Search service configuration database.
I followed these steps and the configuration finished successfully.
1. On the Start menu, click Run. In the Open box, type regedit and then click OK.
2. In the Registry Editor, navigate to the following subkey, and then delete it:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\WSS\Services\Microsoft.SharePoint.Search.Administration.SPSearchService3.
 Run the SharePoint Products and Technologies Configuration Wizard again.


Monday, February 20, 2012

How to : working with HBase Delete API

  • org.apache.hadoop.hbase.client.Delete
     HBase provide Delete to perform delete on a column(s), Column-Family(s) or entire Row, when Delete object is instantiated with a rowkey. 
     Delete accepts a Long Timestamp as parameter with Column-Family and a qualifier, which deletes all versions having smaller time-stamps. Delete creates a tombstone for any column or its version been deleted, HBase does the final deletion later when it goes for major compaction. 
IMPORTANT : If you try to 'put' data with the same timestamp which has been deleted recently, you'll not see it until HBase does its compaction. Though you'll not get any error or exception while doing  a 'put' but the same time you'll not see any result with 'scan' or 'get' until compaction happen. 
   If you don't provide a timestamp, default is current system time in milliseconds. 
Currently Update is not supported in HBase tables. A 'Delete' with 'put' is required to achieve this. If Update is on a column having multiple versions then timestamp plays critical role in maintaining the version order. Design your HBase schema accordingly :)

     To delete multiple rows or bulk delete, use 
public void delete(List<Delete> deletes)
            throws IOException
method which is under HTable class.

Tuesday, January 31, 2012

Hadoop, HBase in Amazon Cloud (AWS)

Hadoop & HBase can be configured in Amazon Cloud to take the advantage of Distributed Computing where one can quickly start new instances as per requirement (or) load to an application at any time.


Elastic Compute Cloud  (EC2) : It provides easy access and configurable instances required to scale application easily.
Elastic Map Reduce (EMR) : It provides support for Hadoop to run Map-Reduce jobs on top of EC2 and S3 data storage.
Simple Storage Service (S3) : A persistent data storage service in Amazon Cloud with high availability.


Out of many running Amazon Machine Image (AMI), one acts as Namenode and rest as DataNode for Hadoop. NameNode or MasterNode contains HBase HMaster running on it which uses 'ssh', public-private-key to communicate to other SlaveNode. JobTracker talks to the NameNode and gets location of Data and it tells TaskTracker to perform the actual processing of Data on several DataNode/SlaveNode. HBase keep same data on many DataNode( equal to replication value defined HBase configuration) for faster access. When a table size/hit increases rapidly, it divides it into two to handle the bottleneck for any DataNode. 


Locally configured Hadoop, HBase can be converted to Amazon AMI and deployed to Cloud directly with any number of same instance. 

Monday, January 23, 2012

How to : working with HBase Filter API

  • org.apache.hadoop.hbase.filter
        HBase provides Filter to perform searches for RowKey, Column Family, Column & Column Values. It could be anything from Binary comparator, Column Value, Prefix Filter, Regex to Timestamp filters etc.
A complete list is available at HBase Filter API documentation page.


You can create a FilterList which may contain number of FilterList as its child. Each Filter in a FilterList evaluates on either FilterList.Operator.MUST_PASS_ONE or FilterList.Operator.MUST_PASS_ALL .


Once Filter/FilterList is created  (eg: yourFilterList.addFilter(yourFilter); ), You should add it to Scan object (eg: scan.setFilter(yourFilterList); ). that's it. At this point you are ready to call getScanner method with HTable object instance. It returns you a iterable ResultScanner which contains number of rows matching your Filter criteria.