SharePoint Orange: Extending Hive : Custom Mapper & Reducer

Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

Hive : A data warehouse system on top of hadoop for adhoc query.

Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data.
Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.

Extending Mapper : By default hive support few inputFormat where every line is separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs. Hive provides us to extend GenericUDTF (User Defined Transform Function) class for this purpose.

Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.

These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.

Thank You

SharePoint Orange

Navigation (Pages)

Friday, August 31, 2012

Extending Hive : Custom Mapper & Reducer

No comments:

Post a Comment

About Me

@alokawi

Blog Archive

Labels

Contact Alok