tag:blogger.com,1999:blog-36052466412561468342024-03-17T01:15:11.353-07:00SharePoint OrangeData Science Hadoop HDFS HBase Hive Pig Chukwa MapReduce EC2 SharePoint Spark Storm Kafka DockerAlok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.comBlogger46125tag:blogger.com,1999:blog-3605246641256146834.post-82096987837173991002018-10-21T08:59:00.001-07:002018-10-21T08:59:10.079-07:00[Repost] Top 5 Mistakes to Avoid When Writing Apache Spark Applications <div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe src="//www.slideshare.net/slideshow/embed_code/key/jwiwqfaY4CKI1N" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications" title="Top 5 Mistakes to Avoid When Writing Apache Spark Applications" target="_blank">Top 5 Mistakes to Avoid When Writing Apache Spark Applications</a> </strong> from <strong><a href="https://www.slideshare.net/cloudera" target="_blank">Cloudera, Inc.</a></strong> </div>
Spark Summit East 2016 presentation by Mark Grover and Ted Malaska
</div>Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-5486857833162920052017-06-06T09:07:00.001-07:002017-06-06T09:09:43.417-07:00Intro to Graph Databases Neo4j<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="340" src="https://www.youtube.com/embed/NH6WoJHN4UA" width="520"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-41147400972022869842017-03-30T09:29:00.001-07:002017-06-06T09:13:07.412-07:00Enterprise Integration Patterns : What, Why, How<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="344" src="https://www.youtube.com/embed/C8vWt-emxK0" width="459"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-82761971830586077522017-02-27T11:52:00.002-08:002017-02-27T11:52:49.206-08:00How to : working with Graph database, Tinkerpop and Gremlin<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="485" marginheight="0" marginwidth="0" scrolling="no" src="//www.slideshare.net/slideshow/embed_code/key/pAyuNHAqehnzTn" style="border-width: 1px; border: 1px solid #ccc; margin-bottom: 5px; max-width: 100%;" width="595"> </iframe> <br />
<div style="margin-bottom: 5px;">
<strong> <a href="https://www.slideshare.net/calebwjones/intro-to-graph-databases-using-tinkerpops-titandb-and-gremlin" target="_blank" title="Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin">Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin</a> </strong> from <strong><a href="https://www.slideshare.net/calebwjones" target="_blank">Caleb Jones</a></strong>
<a href="https://www.slideshare.net/calebwjones/intro-to-graph-databases-using-tinkerpops-titandb-and-gremlin">https://www.slideshare.net/calebwjones/intro-to-graph-databases-using-tinkerpops-titandb-and-gremlin</a>
</div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-75191842809612495252016-03-18T04:33:00.002-07:002016-03-18T04:35:10.380-07:00SparkR with Zeppelin<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="485" marginheight="0" marginwidth="0" scrolling="no" src="//www.slideshare.net/slideshow/embed_code/key/ftKASHhAjzFi55" style="border-width: 1px; border: 1px solid #ccc; margin-bottom: 5px; max-width: 100%;" width="595"> </iframe> <br />
<div style="margin-bottom: 5px;">
<strong> <a href="https://www.slideshare.net/felixcss/sparkr-zeppelin-52650031" target="_blank" title="SparkR + Zeppelin">SparkR + Zeppelin</a> </strong> from <strong><a href="https://www.slideshare.net/felixcss" target="_blank">felixcss</a></strong> </div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-38114497158473850362016-01-02T08:03:00.003-08:002016-01-02T08:04:44.845-08:00Algorithm : Types, Classification and Definition<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<ol style="text-align: left;">
<li><span style="font-size: large;">Simple recursive algorithms</span></li>
<li><span style="font-size: large;">Backtracking algorithms</span></li>
<li><span style="font-size: large;">Divide and conquer algorithms</span></li>
<li><span style="font-size: large;">Dynamic programming algorithms</span></li>
<li><span style="font-size: large;">Greedy algorithms</span></li>
<li><span style="font-size: large;">Branch and bound algorithms</span></li>
<li><span style="font-size: large;">Brute force algorithms</span></li>
<li><span style="font-size: large;">Randomized algorithms</span></li>
</ol>
<div>
<span style="font-size: large;"><br /></span></div>
<div>
<span style="font-size: large;"><b>Complexity of Algorithms : </b></span></div>
<div>
<ol style="text-align: left;">
<li><span style="font-size: large;">Constant - O(1)</span></li>
<li><span style="font-size: large;">Logarithmic - O(log(N))</span></li>
<li><span style="font-size: large;">Linear - O(N)</span></li>
<li><span style="font-size: large;">Quadratic - O(N*N)</span></li>
<li><span style="font-size: large;">Cubic </span><span style="font-size: large;"> - O(N*N*N)</span></li>
<li><span style="font-size: large;">...</span></li>
<li><span style="font-size: large;">Exponential </span><span style="font-size: large;"> - O(N!) or O(2^N) or O(N^K) or many others.</span></li>
</ol>
</div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com1tag:blogger.com,1999:blog-3605246641256146834.post-19312860024304031232015-12-29T04:43:00.001-08:002015-12-29T04:43:32.049-08:00Road to Machine Learning<div dir="ltr" style="text-align: left;" trbidi="on">
<ol style="text-align: left;">
<li>You start with a dataset to analyse. - Purchase / Social / Medical / Travel</li>
<li>Many variable are typically collected. - Categorical / Continuous / Geo</li>
<li>Majority of them can be irrelevant and cause noise.</li>
<li>Data Mining is Statistics at Scale and Speed.</li>
<li>Applications in Intelligence / Genetics / Natural Sc. / Bussiness.</li>
<li>Data Mining has origin with Categorical data whereas Statistics deals with Continuous data.</li>
<li>Large model overfits the training dataset and may lead to higher prediction error with new situations.</li>
<li>Consider if predictor variable would be available and relationship holds in future data.</li>
<li>Cluster analysis is example for <i><b>Unsupervised learning</b></i></li>
<li>Dimension Reduction</li>
<li>Association Rules</li>
<li>Classification is example of <b style="font-style: italic;">Supervised learning</b></li>
<li>Regression, Regression Trees, Nearest Neighbour - Continuous response.</li>
<li>Logistic Regression, Classification Trees, Nearest Neighbour, Discriminant analysis and Naive Bayes methods are well suited for Categorical response. </li>
<li><b>Data Mining </b>should be viewed as a process :</li>
<ol>
<li>Data Storage & PreProcessing</li>
<li>Identify variables for investigation</li>
<li>Screen the outliers and missing values from data</li>
<li>Data need to be partitioned for <b><i>training</i></b>, <b><i>test</i></b> and <b><i>evaluation</i></b> set.</li>
<li>Use <b><i>Sampling</i></b> for Large datasets.</li>
<li>Visualize your data - Line, Bar, Scatter, Box, Histogram, Map, Geo</li>
<li>Summary of data - Mean, Median, Mode, Standard Deviation, Correlation, Principal Components</li>
<li>Apply appropriate model - Linear, Logistic, Trees, K-means ...</li>
<li>Verify finding against <b><i>evaluation </i></b>set.</li>
<li>Get the insights, Apply the findings! Plan - do - check - act !!</li>
</ol>
</ol>
More on the next blog! Thank you!!<br /><a href="https://www.linkedin.com/in/alokawi">https://www.linkedin.com/in/alokawi</a> ( <i>Data Engineer, Analytics Engineer, Data Science </i>)<br /><div>
<br /></div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com2tag:blogger.com,1999:blog-3605246641256146834.post-65256512907746945502015-10-16T12:23:00.001-07:002015-10-16T12:25:09.005-07:00Correlation, Regression and Causation ?<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<br />
<span style="color: purple; font-size: x-large;"><b>What is a Correlation in statistics?</b></span><br />
<br />
<span style="font-size: large;">Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect. </span><br />
<br />
<br />
<span style="color: purple; font-size: x-large;"><b>What is a Regression in statistics?</b></span><br />
<br />
<span style="font-size: large;">In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.</span><br />
<div>
<span style="font-size: large;"><br /></span>
<br />
<div class="nmcw nmbl nmca" id="nmtbi3cntw" style="height: 203px; overflow: hidden; transition: height 300ms ease-out;">
<div >
<b style="color: purple; font-size: x-large;">What is a Causation in statistics?</b></div>
<span style="background-color: transparent; font-size: large;">When an article says that causation was found, this means that the researchers found that changes in one variable they measured directly caused changes in the other.</span></div>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://i.stack.imgur.com/ff8t0.gif" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://i.stack.imgur.com/ff8t0.gif" height="196" width="640" /></a></div>
<div class="nmcw nmbl nmca" id="nmtbi3cntw" style="background-color: white; color: #222222; font-family: arial, sans-serif; height: 203px; overflow: hidden; transition: height 300ms ease-out;">
<b style="color: purple; font-family: 'Times New Roman'; font-size: xx-large;"><br /></b></div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-23556659414707837302015-01-12T23:37:00.001-08:002015-01-12T23:39:38.858-08:00What Does a Data Scientist Do? - http://datasciencelondon.org/<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="485" marginheight="0" marginwidth="0" scrolling="no" src="//www.slideshare.net/slideshow/embed_code/16190193" style="border-width: 1px; border: 1px solid #CCC; margin-bottom: 5px; max-width: 100%;" width="595"> </iframe> <br />
<div style="margin-bottom: 5px;">
<strong> <a href="https://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do" target="_blank" title="Big Data [sorry] & Data Science: What Does a Data Scientist Do?">Big Data [sorry] & Data Science: What Does a Data Scientist Do?</a> </strong> from <strong><a href="https://www.slideshare.net/datasciencelondon" target="_blank">Data Science London</a></strong> </div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com5tag:blogger.com,1999:blog-3605246641256146834.post-35650507353289888092014-09-29T11:28:00.001-07:002015-01-12T23:38:51.472-08:00Alibaba Group CEO, Jack Ma, (马云),Jack Ma' Speech In Stanford University,...<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="444" src="//www.youtube.com/embed/MRp4jiJed3c" width="619"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com1tag:blogger.com,1999:blog-3605246641256146834.post-90515140438320952702014-09-10T23:32:00.001-07:002014-09-10T23:36:24.186-07:00Decipher everything!!<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7BoQFS0kXKTW5iaIDucAc21FxW4OyBllzaOBcR5iqjewjiYDlIiZ62zEfi7k-a5UDtOV0bFr6pVDz_dobEP6_1ky1fgrxz1YcQuJ4zHjlttFk1YZZW4BOM5xnAua9lLNwgvJ2Z0qqZW8c/s1600/IMG_9689317194047.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7BoQFS0kXKTW5iaIDucAc21FxW4OyBllzaOBcR5iqjewjiYDlIiZ62zEfi7k-a5UDtOV0bFr6pVDz_dobEP6_1ky1fgrxz1YcQuJ4zHjlttFk1YZZW4BOM5xnAua9lLNwgvJ2Z0qqZW8c/s1600/IMG_9689317194047.jpeg" height="243" width="320" /></a></div>
<br /></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-87608711816019178702014-08-04T05:02:00.001-07:002014-09-10T23:35:22.496-07:00Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="521" marginheight="0" marginwidth="0" scrolling="no" src="//www.slideshare.net/slideshow/embed_code/31944971?rel=0" style="border-width: 1px; border: 1px solid #CCC; margin-bottom: 5px; max-width: 100%;" width="612"> </iframe> <br />
<div style="margin-bottom: 5px;">
<strong> <a href="https://www.slideshare.net/hortonworks/hortonworks-elastic-searchfinal" target="_blank" title="Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data">Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data</a> </strong> from <strong><a href="http://www.slideshare.net/hortonworks" target="_blank">Hortonworks</a></strong> </div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com1tag:blogger.com,1999:blog-3605246641256146834.post-32671693489025170392014-07-25T05:07:00.001-07:002014-07-25T05:08:49.816-07:00Tech evangelist Carlos Dominguez at TEDxNJIT<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="420" src="//www.youtube.com/embed/ZGkrr6wbxb8" width="610"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-35321244823120660372013-08-24T08:10:00.000-07:002013-08-24T08:10:22.161-07:00A Quick Tutorial on Mahout’s Recommendation Engine <div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="466" marginheight="0" marginwidth="0" mozallowfullscreen="" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/11795960" style="border-width: 1px 1px 0; border: 1px solid #CCC; margin-bottom: 5px;" webkitallowfullscreen="" width="580"> </iframe> <br />
<br />
<b><span style="color: #444444; font-size: large;"><i>Nicely explained Recommendation Engine algorithm with series of Map Reduce jobs to accomplish prediction.</i></span></b></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com1tag:blogger.com,1999:blog-3605246641256146834.post-65629884894289111512013-07-07T12:39:00.001-07:002013-08-24T08:11:38.164-07:00Machine Learning and Probabilistic Programming<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="400" src="//www.youtube.com/embed/S6IbD86Dbvc" width="620"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-11854703632106517032013-06-25T11:48:00.000-07:002013-06-25T11:48:03.488-07:00The world needs all kinds of minds<div dir="ltr" style="text-align: left;" trbidi="on">
<br /><iframe width="600" height="400" src="http://www.youtube.com/embed/fn_9f5x0f1Q" frameborder="0" allowfullscreen></iframe></div>Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-61580580835392190342013-06-09T11:54:00.000-07:002013-06-09T11:54:29.225-07:00Lucene Introduction<div dir="ltr" style="text-align: left;" trbidi="on">
<br /><iframe src="http://www.slideshare.net/slideshow/embed_code/2529729" width="512" height="421" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe> </div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-78704973219490786052013-05-17T11:01:00.001-07:002013-05-17T11:01:10.626-07:00Grid Computing<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe allowfullscreen="" frameborder="0" height="488" marginheight="0" marginwidth="0" mozallowfullscreen="" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/11857367" style="border-width: 1px 1px 0; border: 1px solid #CCC; margin-bottom: 5px;" webkitallowfullscreen="" width="582"> </iframe> </div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-74374276591160231652013-04-19T08:59:00.000-07:002013-04-20T08:03:54.473-07:00Ubuntu for tablets<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe width="680" height="425" src="http://www.youtube.com/embed/5_4fXQcxFRs" frameborder="0" allowfullscreen></iframe>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-5343819106286923882013-03-21T10:52:00.001-07:002013-03-21T10:53:47.102-07:00Information is the Oil of 21st century<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<iframe width="640" height="480" src="http://www.youtube.com/embed/-2240_fDJQE" frameborder="0" allowfullscreen></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-71069222164871024852013-03-12T11:55:00.001-07:002013-03-12T12:03:57.541-07:00Near-Optimal Parallel Join Processing in MapReduce<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<span id="goog_1911137159"></span><span id="goog_1911137160"></span><a href="http://www.blogger.com/"></a><br /></div>
<br />
<iframe width="653" height="370" src="http://www.youtube.com/embed/kiuUGXWRzPA" frameborder="0" allowfullscreen></iframe>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-90924892896559286012013-02-19T10:22:00.000-08:002013-03-12T12:02:53.732-07:00Computer Science (playlist)<div dir="ltr" style="text-align: left;" trbidi="on">
<iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/videoseries?list=EC36E7A2B75028A3D6" width="425"></iframe></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-25209857412171675322013-01-13T10:24:00.000-08:002013-01-13T10:43:35.108-08:00V3, S4, CRUD, CRAP<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<object class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://3.gvt0.com/vi/8gMp0YC0_kM/0.jpg" height="266" width="320"><param name="movie" value="http://www.youtube.com/v/8gMp0YC0_kM&fs=1&source=uds" /><param name="bgcolor" value="#FFFFFF" /><param name="allowFullScreen" value="true" /><embed width="320" height="266" src="http://www.youtube.com/v/8gMp0YC0_kM&fs=1&source=uds" type="application/x-shockwave-flash" allowfullscreen="true"></embed></object></div>
<br /></div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-11459063939770246222012-08-31T13:21:00.000-07:002012-08-31T13:22:03.972-07:00Extending Hive : Custom Mapper & Reducer<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="color: #0b5394; font-size: large;"><b>Hive : A data warehouse system on top of hadoop for adhoc query.</b></span><br />
<br />
<ul style="text-align: left;">
<li><span style="font-size: large;"><span style="color: #0b5394;">Query get converted into map & reduce tasks and run in parallel on number of nodes and bring back result quickly from any kind of massive data. </span></span></li>
<li><span style="font-size: large;"><span style="color: #0b5394;">Hive also allow us to extend and write our implementation for map & reduce job. These can be done with scripts written in any programming language and then used as <b>hive-functions.</b></span></span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-size: large;"><b style="color: purple;">Extending Mapper : </b><span style="color: #0b5394;">By default hive support few inputFormat where every line is separated by '\n' and fields are separated by ',' or '|' etc. But when data is not in a format which hive can understand, we can write logic to parse data in extending mapper class and then emit key-values pairs. Hive provides us to </span></span><span style="color: #0b5394; font-size: large;">extend <b>GenericUDTF (</b>User Defined Transform Function<b>) </b>class for this purpose. </span></li>
</ul>
<ul style="text-align: left;">
<li><span style="color: #0b5394; font-size: large;"><b style="color: purple;">Extending Reducer : </b></span><span style="color: #0b5394;"><span style="font-size: large;">Same way we can derive more results in custom reducer by extending </span><span style="font-size: large;"><b>GenericUDAF</b> (</span><span style="font-size: large;">User Defined Aggregate Function</span><span style="font-size: large;">)</span><b style="font-size: x-large;">. </b><span style="font-size: large;">Results can be filtered and passed to multiple/different tables.</span></span></li>
</ul>
<span style="color: #38761d; font-size: large;">These can be very useful when you need to analyse or query on set of data stored in very different format/unstructured. Extending mapper can be helpful to bring those in common key-value pair form and Extending reducer can again split or produce different set of data from map task results.</span><br />
<div>
<span style="color: #0b5394; font-size: large;"><br /></span></div>
<div>
<div style="text-align: center;">
<span style="font-size: large;"><span style="color: #f6b26b;">Thank You</span></span></div>
<span style="color: #134f5c; font-size: large;"><br /></span></div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0tag:blogger.com,1999:blog-3605246641256146834.post-33441720909188149672012-08-31T12:31:00.000-07:002012-08-31T12:53:28.714-07:00What is Apache Sqoop?<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="color: purple; font-size: large;"><b>Apache Sqoop is a tool to bulk import/export data into Hadoop ecosystem (HDFS, HBase or Hive).</b></span><br />
<br />
<ul style="text-align: left;">
<li><span style="color: #0b5394; font-size: large;">works with number of databases and commercial data-warehouses.</span></li>
<li><span style="color: #0b5394;"><span style="font-size: large;">available as command line tool, can be used in java </span><span style="font-size: large;">with passing appropriate arguments.</span></span></li>
<li><span style="color: #0b5394;"><span style="font-size: large;">graduated from the Incubator & became top-level-project in ASF.</span></span></li>
</ul>
<span style="color: purple; font-size: large;"><b>Current version of Sqoop does a map-only job with all the transformation happen in map task. ( <a href="https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2" target="_blank">Sqoop2</a> will possibly have reduce task as well)</b></span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop2archMeta3.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="304" src="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop2archMeta3.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fig : Sqoop 2 Architecture diagram taken from Cloudera.com</td></tr>
</tbody></table>
<div>
<span style="color: #0b5394; font-size: large;">Example : </span></div>
<div>
<b><span style="color: #134f5c;">alok@ubuntu:~/apache/sqoop-1.4.2</span>$ bin/sqoop import --connect jdbc:mysql://<<i>hostname</i>>:3306/<<i>dbname</i>> --username <<i>user</i>> -P --driver com.mysql.jdbc.Driver --table <<i>tablename</i>> --hbase-table <<i>hbase-tablename</i>> --column-family <<i>hbase-columnFamily</i>> --hbase-create-table</b></div>
<div>
<br /></div>
<div>
<span style="font-size: large;">or use it like this in your java programs - </span></div>
<div>
<b>ArrayList<String> args = new ArrayList<String>();</b></div>
<div>
<b>args.add("--connect");</b></div>
<div>
<b>args.add("jdbc:mysql://<<i>hostname</i>>:3306/<<i>dbname</i>>");</b></div>
<div>
<b>args.add("--username");</b></div>
<div>
<b>args.add("<<i>user</i>>");</b></div>
<div>
<b>args.add("--driver");</b></div>
<div>
<b>args.add("com.mysql.jdbc.Driver");</b></div>
<div>
<b>args.add("--table");</b></div>
<div>
<b>args.add("<<i>tablename</i>>");</b></div>
<div>
<b>args.add("--hbase-table");</b></div>
<div>
<b>args.add("<<i>hbase-tablename</i>>");</b></div>
<div>
<b>args.add("--column-family");</b></div>
<div>
<b>args.add("<<i>hbase-colFamilyName</i>>");</b></div>
<div>
<b>args.add("--hbase-create-table");</b></div>
<div>
<b>args.add("--num-mappers");</b></div>
<div>
<b>args.add("2");</b></div>
<div>
<br /></div>
<div>
<span style="font-size: large;">int ret = Sqoop.</span><span style="font-size: large;">runTool(args.toArray(new String[args.size()]));</span></div>
<div>
<ul style="text-align: left;">
<li><span style="color: purple; font-size: large;"><b>Sqoop can write data directly to HDFS or HBase or Hive.</b></span></li>
<li><span style="color: purple; font-size: large;"><b>It can also export data back to RDBMS tables from Hadoop.</b></span></li>
<li><span style="color: purple; font-size: large;"><b>Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.</b></span></li>
</ul>
</div>
<div>
<span style="color: purple; font-size: large;"><br /></span></div>
<div>
<span style="font-size: large;"><br /></span></div>
</div>
Alok Khttp://www.blogger.com/profile/14863045326848859449noreply@blogger.com0