Hive : A data warehouse system on top of hadoop for adhoc query.
- Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data.
- Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as hive-functions.
- Extending Mapper : By default hive support few inputFormat where every line is separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs. Hive provides us to extend GenericUDTF (User Defined Transform Function) class for this purpose.
- Extending Reducer : Same way we can derive more results in custom reducer by extending GenericUDAF (User Defined Aggregate Function). Results can be filtered and passed to multiple/different tables.