Monday, April 30, 2012

What is : Hadoop Sequence File?

Hadoop Sequence File : These are flat files consisting of binary Key-Value Pair. It can store any key-value pair as byte arrays.
3 Types : 
UnCompressed
Record-Compressed
Block-Compressed.

Need for Sequential File : Hadoop is meant for processing BigData. It has 64 MB default block size on any cluster. Which mean any file with size lesser than 64 MB will eventually occupy 64 MB physical space on disk storage.
      In practical, Applications deal with files with fewer KB. So, It is advantageous to keep number of such small file in sequential Key-Value pair, which allows programmers to run similar logic on each file found in a block with help of Map-Reduce Jobs.
    -org.apache.hadoop.io.SequenceFile Class provides Read, Write methods. It also grants provision for 'Sorting' of SequenceFile Keys.
Thank YOU

How to : choose between DOM, SAX or XMLStreamWriter

What is XML : I call it a "language for Internet", It help applications to communicate  seamlessly. At the same time it's in human readable format too. Any XML file typically contains 'Elements' and 'Attributes', which are also called XML 'Node'. 
(Element, CData, Comment, Attribute, Entity, Text are few examples of Node Type).
There are number of API(s) available to work with XML, and each has its positive and negative aspects.


W3C DOM : 
good for - random read and  write with XML nodes.
not suitable for - larger memory footprint, performance.
SAX :
good for : faster read access, it's lightweight.
not suitable for - writing/creating XML nodes.
XMLStreamWriter :
good for - streaming out XML while building it, useful in web services handling larger files.
not suitable for - random read or write, It's sequential, one-way, cursor like implementation. 


All what you need is to choose your API closest to your need. XMLStreamWriter is a good for all purpose. Most effective for mobile devices.

Thank YOU