Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

Hive : A data warehouse system on top of hadoop for adhoc query.

  • Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data. 
  • Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.
  • Extending Mapper : By default hive support few inputFormat where every line is  separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs.  Hive provides us to extend GenericUDTF (User Defined Transform Functionclass for this purpose. 
  • Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.
These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.

Thank You

What is Apache Sqoop?

Apache Sqoop is a tool to bulk import/export data into Hadoop ecosystem (HDFS, HBase or Hive).

  • works with number of databases and commercial data-warehouses.
  • available as command line tool, can be used in java with passing appropriate arguments.
  • graduated from the Incubator & became top-level-project in ASF.
Current version of Sqoop does a map-only job with all the transformation happen in map task. ( Sqoop2 will possibly have reduce task as well)
Fig : Sqoop 2 Architecture diagram taken from Cloudera.com
Example : 
alok@ubuntu:~/apache/sqoop-1.4.2$ bin/sqoop import --connect jdbc:mysql://<hostname>:3306/<dbname> --username <user> -P --driver com.mysql.jdbc.Driver --table <tablename> --hbase-table <hbase-tablename> --column-family <hbase-columnFamily> --hbase-create-table

or use it like this in your java programs - 
ArrayList<String> args = new ArrayList<String>();
args.add("--connect");
args.add("jdbc:mysql://<hostname>:3306/<dbname>");
args.add("--username");
args.add("<user>");
args.add("--driver");
args.add("com.mysql.jdbc.Driver");
args.add("--table");
args.add("<tablename>");
args.add("--hbase-table");
args.add("<hbase-tablename>");
args.add("--column-family");
args.add("<hbase-colFamilyName>");
args.add("--hbase-create-table");
args.add("--num-mappers");
args.add("2");

int ret = Sqoop.runTool(args.toArray(new String[args.size()]));
  • Sqoop can write data directly to HDFS or HBase or Hive.
  • It can also export data back to RDBMS tables from Hadoop.
  • Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.