Bda U3
Bda U3
a. Parallel import/export
While it comes to import and export the data, Sqoop uses
YARN framework. Basically, that offers fault tolerance on
top of parallelism.
b. Connectors for all major RDBMS Databases
However, for multiple RDBMS databases, Sqoop offers
connectors, covering almost the entire circumference.
c. Import results of SQL query
Also, in HDFS, we can import the result returned from an SQL query.
d. Incremental Load
Moreover, we can load parts of table whenever it is updated. Since
Sqoop offers the facility of the incremental load.
e. Full Load
It is one of the important features of sqoop, in which we can load
the whole table by a single command in Sqoop. Also, by using a
single command we can load all the tables from a database.
f. Kerberos Security Integration
Basically, Sqoop supports Kerberos authentication. Where Kerberos
defined as a computer network authentication protocol. That works
on the basis of ‘tickets’ to allow nodes communicating over a non-
secure network to prove their identity to one another in a secure manner.
g. Load data directly into HIVE/HBase
Basically, for analysis, we can load data directly into Apache
Hive. Also, can dump your data in HBase, which is a NoSQL
database.
h. Compression
By using deflate(gzip) algorithm with –compress argument,
We can compress your data. Moreover, it is also possible by
specifying –compression-codec argument. In addition, we can also
load compressed table in Apache Hive.
i. Support for Accumulo
It is possible that rather than a directory in HDFS we can instruct
Sqoop to import the table in Accumulo.
Sqoop Architecture and Working
By using the below diagram, Let’s understand Apache Sqoop
Architecture and how Sqoop import works internally:
Sqoop Architecture and Working
• Moreover, our main task gets divided into subtasks, while we submit Sqoop command.
However, map task individually handles it internally.
• On defining map task, it is the subtask that imports part of data to the Hadoop Ecosystem.
Likewise, we can say all map tasks import the whole data collectively.
However, Export also works in the same way.
• A tool which exports a set of files from HDFS back to an RDBMS is a Sqoop Export tool.
Moreover, there are files which behave as input to Sqoop which also contain records. Those files
what we call as rows in the table.
• Moreover, the job is mapped into map tasks, while we submit our job, that brings the chunk of data
from HDFS. Then we export these chunks to a structured data destination.
• Likewise, we receive the whole data at the destination by combining all these exported chunks of
• In addition, in case of aggregations, we require reducing
phase. However, Sqoop does not perform any aggregations it
just imports and exports the data. Also, on the basis of the
number defined by the user, map job launch multiple
mappers.
• In addition, each mapper task will be assigned with a part of
data to be imported for Sqoop import. Also, to get high-
performance Sqoop distributes the input data among the
mappers equally. Afterwards, by using JDBC each mapper
creates the connection with the database. Also fetches the
part of data assigned by Sqoop. Moreover, it writes it into
HDFS or Hive or HBase on the basis of arguments provided
in the CLI.
Sqoop Import- Importing Data From RDBMS to HDFS
To import data into HDFS we use the following syntax for
importing in Sqoop. Such as:
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
Argument Description
Argument Description
Argument Description
Override mapping from SQL to Java type for
–map-column-java <mapping>
configured columns.
After Sqoop Import, there is a tool which exports a set of files from HDFS
back to RDBMS, that tool is what we call an Export Tool in Apache Sqoop.
However, the export arguments can be entered in any order with respect to
one another, but the Hadoop generic arguments must precede any export
arguments.
Table 1. Common arguments
Argument Description
–connect <jdbc-uri> Specify JDBC connect string
–connection-manager <class-name> Specify connection manager class to use
Argument Description
Argument Description
–columns <col,col,col…> Columns to export to table
–direct Use direct export fast path
–export-dir <dir> HDFS source path for the export
-m,–num-mappers <n> Use n map tasks to export in parallel
–table <table-name> Table to populate
–call <stored-proc-name> Stored Procedure to call
Anchor column to use for updates. Use a comma
–update-key <col-name> separated list of columns if there are more than
one column.
Specify how updates are performed when new
–update-mode <mode> rows are found with non-matching keys in
database.
Legal values for mode include updateonly
(default) and allowinsert.
The string to be interpreted as null for string
–input-null-string <null-string>
columns
Argument Description
Moreover, Sqoop will fail to find enough columns per line, if we specify incorrect
delimiters. Basically, that will cause export map tasks to fail by throwing
ParseExceptions.
Table 6. Code generation arguments:
Argument Description
–bindir <dir> Output directory for compiled objects
Sets the generated class name. This
overrides –package-name. When
–class-name <name>
combined with –jar-file, sets the input
class.
–jar-file <file> Disable code generation; use specified jar
–outdir <dir> Output directory for generated code
–package-name <name> Put auto-generated classes in this package
Argument Description
–enclosed-by <char> Sets a required field enclosing character
Argument Description
–hive-table <table-name> Sets the table name to use when importing to Hive.
Drops \n, \r, and \01 from string fields when
–hive-drop-import-delims
importing to Hive.
Replace \n, \r, and \01 from string fields with user
–hive-delims-replacement
defined string when importing to Hive.
–hive-partition-key Name of a hive field to partition are sharded on
String-value that serves as partition key for this
–hive-partition-value <v>
imported into hive in this job.
Override default mapping from SQL type to Hive
–map-column-hive <map>
type for configured columns.
Table 9. Sqoop Import – HBase arguments
Argument Description
–column-family <family> Sets the target column family for the import
Argument Description
Specifies an Accumulo table to use as the target
–accumulo-table <table-nam>
instead of HDFS
–accumulo-column-family <family> Sets the target column family for the import
Argument Description
–bindir <dir> Output directory for compiled objects
1. Lack of standardization : There are many different types of NoSQL databases, each with its own unique strengths and
weaknesses. This lack of standardization can make it difficult to choose the right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that they do not guarantee the
consistency, integrity, and durability of data. This can be a drawback for applications that require strong data consistency
guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but it provides very little
functionality. Relational databases are a better choice in the field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words, two database systems
are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle complex queries, which means that they are
not a good fit for applications that require complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional relational databases. This can make
them less reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a large amount of data as simple as possible.
But it is not so easy. Data management in NoSQL is much more complex than in a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup of
data in a consistent manner.
10.Large document size : Some database systems like MongoDB and CouchDB store data in JSON format. This means that
documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts since they
increase the document size.
Overview of HBase
• HBase is a data model that is similar to Google’s big table. It is an open source, distributed
database developed by Apache software foundation written in Java. HBase is an essential part of
our Hadoop ecosystem. HBase runs on top of HDFS (Hadoop Distributed File System). It can store
massive amounts of data from terabytes to petabytes. It is column oriented and horizontally
scalable
• History of HBase
HBase Architecture
HBase architecture has 3 main components: HMaster,
Region Server, Zookeeper.
• All the 3 components are described below:
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned to
region server as well as DDL (create, delete table) operations. It monitor all Region Server instances present in
the cluster. In a distributed environment, Master runs several background threads. HMaster has many features
like controlling load balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the basic building elements
of HBase cluster that consists of the distribution of tables and are comprised of Column families. Region Server
runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region Server are responsible for
several things, like handling, managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc. Clients communicate with region servers
via zookeeper.
• Advantages of HBase –
• Disadvantages of HBase –
2. No transaction support
• In CRUD operations, 'U' is an acronym for the update, which means making
updates to the records present in the SQL tables. So, we will use the
UPDATE command to make changes in the data present in tables.
Syntax: