Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

Hive : A data warehouse system on top of hadoop for adhoc query.

  • Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data. 
  • Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.
  • Extending Mapper : By default hive support few inputFormat where every line is  separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs.  Hive provides us to extend GenericUDTF (User Defined Transform Functionclass for this purpose. 
  • Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.
These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.

Thank You

No comments:

Post a Comment