Friday, August 31, 2012

What is Apache Sqoop?

Apache Sqoop is a tool to bulk import/export data into Hadoop ecosystem (HDFS, HBase or Hive).

  • works with number of databases and commercial data-warehouses.
  • available as command line tool, can be used in java with passing appropriate arguments.
  • graduated from the Incubator & became top-level-project in ASF.
Current version of Sqoop does a map-only job with all the transformation happen in map task. ( Sqoop2 will possibly have reduce task as well)
Fig : Sqoop 2 Architecture diagram taken from Cloudera.com
Example : 
alok@ubuntu:~/apache/sqoop-1.4.2$ bin/sqoop import --connect jdbc:mysql://<hostname>:3306/<dbname> --username <user> -P --driver com.mysql.jdbc.Driver --table <tablename> --hbase-table <hbase-tablename> --column-family <hbase-columnFamily> --hbase-create-table

or use it like this in your java programs - 
ArrayList<String> args = new ArrayList<String>();
args.add("--connect");
args.add("jdbc:mysql://<hostname>:3306/<dbname>");
args.add("--username");
args.add("<user>");
args.add("--driver");
args.add("com.mysql.jdbc.Driver");
args.add("--table");
args.add("<tablename>");
args.add("--hbase-table");
args.add("<hbase-tablename>");
args.add("--column-family");
args.add("<hbase-colFamilyName>");
args.add("--hbase-create-table");
args.add("--num-mappers");
args.add("2");

int ret = Sqoop.runTool(args.toArray(new String[args.size()]));
  • Sqoop can write data directly to HDFS or HBase or Hive.
  • It can also export data back to RDBMS tables from Hadoop.
  • Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.


No comments:

Post a Comment