SharePoint Orange: August 2012

Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

Hive : A data warehouse system on top of hadoop for adhoc query.

Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data.
Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.

Extending Mapper : By default hive support few inputFormat where every line is separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs. Hive provides us to extend GenericUDTF (User Defined Transform Function) class for this purpose.

Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.

These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.

Thank You

What is Apache Sqoop?

Apache Sqoop is a tool to bulk import/export data into Hadoop ecosystem (HDFS, HBase or Hive).

works with number of databases and commercial data-warehouses.
available as command line tool, can be used in java with passing appropriate arguments.
graduated from the Incubator & became top-level-project in ASF.

Current version of Sqoop does a map-only job with all the transformation happen in map task. ( Sqoop2 will possibly have reduce task as well)

Fig : Sqoop 2 Architecture diagram taken from Cloudera.com

Example :

alok@ubuntu:~/apache/sqoop-1.4.2$ bin/sqoop import --connect jdbc:mysql://<hostname>:3306/<dbname> --username <user> -P --driver com.mysql.jdbc.Driver --table <tablename> --hbase-table <hbase-tablename> --column-family <hbase-columnFamily> --hbase-create-table

or use it like this in your java programs -

ArrayList<String> args = new ArrayList<String>();

args.add("--connect");

args.add("jdbc:mysql://<hostname>:3306/<dbname>");

args.add("--username");

args.add("<user>");

args.add("--driver");

args.add("com.mysql.jdbc.Driver");

args.add("--table");

args.add("<tablename>");

args.add("--hbase-table");

args.add("<hbase-tablename>");

args.add("--column-family");

args.add("<hbase-colFamilyName>");

args.add("--hbase-create-table");

args.add("--num-mappers");

args.add("2");

int ret = Sqoop.runTool(args.toArray(new String[args.size()]));

Sqoop can write data directly to HDFS or HBase or Hive.
It can also export data back to RDBMS tables from Hadoop.
Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.

SharePoint Orange

Navigation (Pages)

Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

What is Apache Sqoop?

About Me

@alokawi

Blog Archive

Labels

Contact Alok