0% found this document useful (0 votes)

9 views59 pages

Bda U3

Uploaded by

snehapallap18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views59 pages

Bda U3

Uploaded by

snehapallap18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

UNIT-III

 SQOOP: Introduction to SQOOP

 SQOOP imports: From Database to HDFS/Hive
 SQOOP exports: From HDFS/Hive to Database
 Incremental imports
 NoSQL &HBase: Overview
 HBase architecture
 CRUD operations
Big Data tool, which we use for transferring data between Hadoop and
relational database servers is what we call Sqoop.

• What is Apache Sqoop?

• An open-source data integration programme called Apache Sqoop is
intended to make it easier to move data between Apache Hadoop and
conventional relational databases or other structured data repositories.
• The difficulty of effectively integrating data from external systems into
Hadoop’s distributed file system (HDFS) and exporting processed or
analyzed data back to relational databases for use in business intelligence
or reporting tools is addressed.
• Data import from several relational databases, including MySQL, Oracle,
SQL Server, and PostgreSQL, into HDFS is one of Sqoop’s core
functionalities.
• It enables incremental imports, allowing users to import just
the new or changed records since the last import, minimising
data transfer time and guaranteeing data consistency. Parallel
imports are supported, enabling the efficient transfer of big
datasets.
• When it comes to exporting, Sqoop makes it possible to send
processed or analysed data from HDFS back to relational
databases, guaranteeing that the knowledge obtained from
big data analysis can be incorporated into current data
warehousing systems without any difficulty.
• Basically, Sqoop (“SQL-to-Hadoop”) is a straightforward
command-line tool. It offers the following capabilities:

1.Generally, helps to Import individual tables or entire

databases to files in HDFS
2.Also can Generate Java classes to allow you to interact with
your imported data
3.Moreover, it offers the ability to import from SQL databases
straight into your Hive data warehouse.
• Key Features of Sqoop
There are many salient features of Sqoop, which shows us the
several reasons to learn sqoop.

a. Parallel import/export
While it comes to import and export the data, Sqoop uses
YARN framework. Basically, that offers fault tolerance on
top of parallelism.
b. Connectors for all major RDBMS Databases
However, for multiple RDBMS databases, Sqoop offers
connectors, covering almost the entire circumference.
c. Import results of SQL query
Also, in HDFS, we can import the result returned from an SQL query.
d. Incremental Load
Moreover, we can load parts of table whenever it is updated. Since
Sqoop offers the facility of the incremental load.
e. Full Load
It is one of the important features of sqoop, in which we can load
the whole table by a single command in Sqoop. Also, by using a
single command we can load all the tables from a database.
f. Kerberos Security Integration
Basically, Sqoop supports Kerberos authentication. Where Kerberos
defined as a computer network authentication protocol. That works
on the basis of ‘tickets’ to allow nodes communicating over a non-
secure network to prove their identity to one another in a secure manner.
g. Load data directly into HIVE/HBase
Basically, for analysis, we can load data directly into Apache
Hive. Also, can dump your data in HBase, which is a NoSQL
database.
h. Compression
By using deflate(gzip) algorithm with –compress argument,
We can compress your data. Moreover, it is also possible by
specifying –compression-codec argument. In addition, we can also
load compressed table in Apache Hive.
i. Support for Accumulo
It is possible that rather than a directory in HDFS we can instruct
Sqoop to import the table in Accumulo.
Sqoop Architecture and Working
By using the below diagram, Let’s understand Apache Sqoop
Architecture and how Sqoop import works internally:
Sqoop Architecture and Working

• By using the below diagram, Let’s understand Apache

Sqoop Architecture and how Sqoop Export works
internally
• Basically, a tool which imports individual tables from RDBMS to HDFS is what we call
Sqoop import tool. However, in HDFS we treat each row in a table as a record.

• Moreover, our main task gets divided into subtasks, while we submit Sqoop command.
However, map task individually handles it internally.
• On defining map task, it is the subtask that imports part of data to the Hadoop Ecosystem.
Likewise, we can say all map tasks import the whole data collectively.
However, Export also works in the same way.
• A tool which exports a set of files from HDFS back to an RDBMS is a Sqoop Export tool.
Moreover, there are files which behave as input to Sqoop which also contain records. Those files
what we call as rows in the table.

• Moreover, the job is mapped into map tasks, while we submit our job, that brings the chunk of data
from HDFS. Then we export these chunks to a structured data destination.
• Likewise, we receive the whole data at the destination by combining all these exported chunks of
• In addition, in case of aggregations, we require reducing
phase. However, Sqoop does not perform any aggregations it
just imports and exports the data. Also, on the basis of the
number defined by the user, map job launch multiple
mappers.
• In addition, each mapper task will be assigned with a part of
data to be imported for Sqoop import. Also, to get high-
performance Sqoop distributes the input data among the
mappers equally. Afterwards, by using JDBC each mapper
creates the connection with the database. Also fetches the
part of data assigned by Sqoop. Moreover, it writes it into
HDFS or Hive or HBase on the basis of arguments provided
in the CLI.
Sqoop Import- Importing Data From RDBMS to HDFS
To import data into HDFS we use the following syntax for
importing in Sqoop. Such as:
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

The very advantage is we can type the sqoop import arguments in

any order with respect to one another. However, when it comes to
the Hadoop generic arguments, those must precede any import
arguments only.
Basically, here all the arguments are grouped into collections
which are organized by function. However, some collections are
present in several tools here. For example, the “common”
Table 1. Sqoop Import – Common arguments
Argument Description
–connect <jdbc-uri> Specify JDBC connect string
Specify connection manager class to
–connection-manager <class-name>
use
Manually specify JDBC driver class to
–driver <class-name>
use
Override
–hadoop-mapped-home <dir>
$HADOOP_MAPRED_HOME
–help Print usage instructions
Set path for a file containing the
–password-file
authentication password
-P Read password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
Optional properties file that provides
–connection-param-file <filename>
connection parameters
Set connection transaction isolation to
–relaxed-isolation
read uncommitted for the mappers.
• a. Connecting to a Database Server
Sqoop is designed to import tables from a database into HDFS. To do
so, you must specify a connect string that describes how to connect to
the database. The connect string is similar to a URL, and is
communicated to Sqoop with the –connect argument. That defines the
server and database to connect to; also specify the port.
• For example:

$ sqoop import –connect jdbc:mysql://database.example.com/employees

Table 2. Sqoop Import – Validation arguments

Argument Description

Enable validation of data copied, supports

–validate
single table copy only.

–validator <class-name> Specify validator class to use.

–validation-threshold <class-name> Specify validation threshold class to use.

–validation-failurehandler <class-name> Specify validation failure handler class to use.

Table 3. Sqoop Import – Import control arguments

Argument Description

–append Append data to an existing dataset in HDFS

–as-avrodatafile Imports data to Avro Data Files

–as-sequencefile Imports data to SequenceFiles

–as-textfile Imports data as plain text (default)

–as-parquetfile Imports data to Parquet Files

–boundary-query <statement> Boundary query to use for creating splits

–columns <col,col,col…> Columns to import from table

–delete-target-dir Delete the import target directory if it exists

–direct Use direct connector if exists for the database

–fetch-size <n> Number of entries to read from database at once.

–inline-lob-limit <n> Set the maximum size for an inline LOB

-m,–num-mappers <n> Use n map tasks to import in parallel
-e,–query <statement> Import the results of statement.
Column of the table used to split work units.
–split-by <column-name> Cannot be used with –autoreset-to-one-
mapperoption.

Import should use one mapper if a table has no

–autoreset-to-one-mapper primary key and no split-by column is provided.
Cannot be used with –split-by <col> option.
–table <table-name> Table to read

–target-dir <dir> HDFS destination dir

–warehouse-dir <dir> HDFS parent for table destination

–where <where clause> WHERE clause to use during import

-z,–compress Enable compression

–compression-codec <c> Use Hadoop codec (default gzip)

The string to be written for a null value for string

–null-string <null-string>
columns

The string to be written for a null value for non-

–null-non-string <null-string>
string columns

Although, both –null-string and –null-non-string arguments are optional. However,

we use the string “null” if not specified.
Table 4. Parameters for overriding mapping

Argument Description
Override mapping from SQL to Java type for
–map-column-java <mapping>
configured columns.

Override mapping from SQL to Hive type for

–map-column-hive <mapping>
configured columns.

Basically, Sqoop is expecting the comma-separated list of mapping in

the form <name of column>=<new type>.
For example:
$ sqoop import … –map-column-java id=String,value=Integer
Also, Sqoop will raise the exception in case that some configured
mapping will not be used.
Sqoop Import Example
Basically, we will understand how to use the import tool in a variety of
situations by the following examples.

In addition, a basic import of a table named EMPLOYEES in the

corp database:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table

EMPLOYEES

Also, a basic import requiring a login:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table
EMPLOYEES \
–username SomeUser -P
Enter password: (hidden)
• So selecting specific columns from the EMPLOYEES table:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

–columns “employee_id,first_name,last_name,job_title”
• Controlling the import parallelism (using 8 parallel tasks):

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

-m 8
Storing data in SequenceFiles, and setting the generated class name to
com.foocorp.Employee:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

–class-name com.foocorp.Employee –as-sequencefile
• Also, specifying the delimiters to use in a text-mode import:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

–fields-terminated-by ‘\t’ –lines-terminated-by ‘\n’ \
–optionally-enclosed-by ‘\”‘

Basically here, importing the data to Hive:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

–hive-import
Also, here, only importing new employees:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \

–where “start_date > ‘2010-01-01′”
Afterwards, changing the splitting column from the default:
$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \
–split-by dept_id
Sqoop Export – Exporting From HDFS to RDBMS

After Sqoop Import, there is a tool which exports a set of files from HDFS
back to RDBMS, that tool is what we call an Export Tool in Apache Sqoop.

Sqoop Export Syntax

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)

However, the export arguments can be entered in any order with respect to
one another, but the Hadoop generic arguments must precede any export
arguments.
Table 1. Common arguments
Argument Description
–connect <jdbc-uri> Specify JDBC connect string
–connection-manager <class-name> Specify connection manager class to use

–driver <class-name> Manually specify JDBC driver class to use

–hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME

–help Print usage instructions
Set path for a file containing the authentication
–password-file
password
-P Read password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
Optional properties file that provides connection
–connection-param-file <filename>
parameters
Table 2. Validation arguments More Details

Argument Description

Enable validation of data copied, supports single

–validate
table copy only.

–validator <class-name> Specify validator class to use.

–validation-threshold <class-name> Specify validation threshold class to use.

–validation-failurehandler <class-name> Specify validation failure handler class to use.

Table 3. Export control arguments:

Argument Description
–columns <col,col,col…> Columns to export to table
–direct Use direct export fast path
–export-dir <dir> HDFS source path for the export
-m,–num-mappers <n> Use n map tasks to export in parallel
–table <table-name> Table to populate
–call <stored-proc-name> Stored Procedure to call
Anchor column to use for updates. Use a comma
–update-key <col-name> separated list of columns if there are more than
one column.
Specify how updates are performed when new
–update-mode <mode> rows are found with non-matching keys in
database.
Legal values for mode include updateonly
(default) and allowinsert.
The string to be interpreted as null for string
–input-null-string <null-string>
columns

The string to be interpreted as null for non-string

–input-null-non-string <null-string>
columns

The table in which data will be staged before

–staging-table <staging-table-name>
being inserted into the destination table.

Indicates that any data present in the staging

–clear-staging-table
table can be deleted.

Use batch mode for underlying statement

–batch
execution.
Table 4. Input parsing arguments:

Argument Description

–input-enclosed-by <char> Sets a required field encloser

–input-escaped-by <char> Sets the input escape character

–input-fields-terminated-by <char> Sets the input field separator

–input-lines-terminated-by <char> Sets the input end-of-line character

–input-optionally-enclosed-by <char> Sets a field enclosing character

Table 5. Output line formatting arguments:
Argument Description
–enclosed-by <char> Sets a required field enclosing character

–escaped-by <char> Sets the escape character

–fields-terminated-by <char> Sets the field separator character

–lines-terminated-by <char> Sets the end-of-line character

Uses MySQL’s default delimiter set: fields: , lines:

–mysql-delimiters
\n escaped-by: \ optionally-enclosed-by:

–optionally-enclosed-by <char> Sets a field enclosing character

Moreover, Sqoop will fail to find enough columns per line, if we specify incorrect
delimiters. Basically, that will cause export map tasks to fail by throwing
ParseExceptions.
Table 6. Code generation arguments:
Argument Description
–bindir <dir> Output directory for compiled objects
Sets the generated class name. This
overrides –package-name. When
–class-name <name>
combined with –jar-file, sets the input
class.
–jar-file <file> Disable code generation; use specified jar
–outdir <dir> Output directory for generated code
–package-name <name> Put auto-generated classes in this package

Override default mapping from SQL type

–map-column-java <m>
to Java type for configured columns.
Sqoop Exports Example

• To populate a table named bar, a basic export in Sqoop is:

$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar \
–export-dir /results/bar_data
• To populate a table named bar with validation enabled: More Details,
another basic Sqoop export

$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar \

–export-dir /results/bar_data –validate
For every record in /results/bar_data an export that calls a stored procedure
named barproc would look like:
$ sqoop export –connect jdbc:mysql://db.example.com/foo –call barproc \
–export-dir /results/bar_data
Incremental Imports
• There is an incremental import mode offered by Sqoop. That can be used
to retrieve only rows newer than some previously imported set of rows.
The following arguments control incremental imports in sqoop:
Table 5. Sqoop Import – Incremental import arguments
Argument Description
Specifies the column to be examined when
determining which rows to import. (the column
–check-column (col) should not be of type
CHAR/NCHAR/VARCHAR/VARNCHAR/
LONGVARCHAR/LONGNVARCHAR)
Specifies how Sqoop determines which rows are
–incremental (mode) new. Legal values for mode include append and
lastmodified.
Specifies the maximum value of the check column
–last-value (value)
from the previous import.
Basically, there are two types of incremental
imports in Sqoop.
One is appended and
second is last modified.

Moreover, to specify the type of incremental

import to perform, we can also use the –
incremental argument.
Table 6. Sqoop Import – Output line formatting arguments

Argument Description
–enclosed-by <char> Sets a required field enclosing character

–escaped-by <char> Sets the escape character

–fields-terminated-by <char> Sets the field separator character

–lines-terminated-by <char> Sets the end-of-line character

Uses MySQL’s default delimiter set: fields: ,

–mysql-delimiters
lines: \n escaped-by: \ optionally-enclosed-by: ‘

–optionally-enclosed-by <char> Sets a field enclosing character

Table 7. Sqoop Import – Input parsing arguments

Argument Description

–input-enclosed-by <char> Sets a required field encloser

–input-escaped-by <char> Sets the input escape character

–input-fields-terminated-by <char> Sets the input field separator

–input-lines-terminated-by <char> Sets the input end-of-line character

–input-optionally-enclosed-by <char> Sets a field enclosing character

Table 8. Sqoop Import – Hive arguments
Argument Description
–hive-home <dir> Override $HIVE_HOME
Import tables into Hive (Uses Hive’s default
–hive-import
delimiters if none are set.)
–hive-overwrite Overwrite existing data in the Hive table.
If set, then the job will fail if the target hive table
–create-hive-table
exits. By default this property is false.

–hive-table <table-name> Sets the table name to use when importing to Hive.
Drops \n, \r, and \01 from string fields when
–hive-drop-import-delims
importing to Hive.
Replace \n, \r, and \01 from string fields with user
–hive-delims-replacement
defined string when importing to Hive.
–hive-partition-key Name of a hive field to partition are sharded on
String-value that serves as partition key for this
–hive-partition-value <v>
imported into hive in this job.
Override default mapping from SQL type to Hive
–map-column-hive <map>
type for configured columns.
Table 9. Sqoop Import – HBase arguments

Argument Description
–column-family <family> Sets the target column family for the import

–hbase-create-table If specified, create missing HBase tables

Specifies which input column to use as the row

key. In case, if input table contains composite key,
–hbase-row-key <col>
then <col> must be in the form of a comma-
separated list of composite key attributes.

Specifies an HBase table to use as the target

–hbase-table <table-name>
instead of HDFS

–hbase-bulkload Enables bulk loading

Table 10. Sqoop Import – Accumulo arguments

Argument Description
Specifies an Accumulo table to use as the target
–accumulo-table <table-nam>
instead of HDFS

–accumulo-column-family <family> Sets the target column family for the import

–accumulo-create-table If specified, create missing Accumulo tables

Specifies which input column to use as the row

–accumulo-row-key <col>
key

(Optional) Specifies a visibility token to apply to

–accumulo-visibility <vis> all rows inserted into Accumulo. Default is the
empty string.

(Optional) Sets the size in bytes of Accumulo’s

–accumulo-batch-size <size>
write buffer. Default is 4MB.
(Optional) Sets the max latency in milliseconds
–accumulo-max-latency <ms>
for the Accumulo batch writer. Default is 0.

Comma-separated list of Zookeeper servers

–accumulo-zookeepers <host:port>
used by the Accumulo instance

–accumulo-instance <table-name> Name of the target Accumulo instance

–accumulo-user <username> Name of the Accumulo user to import as

–accumulo-password <password> Password for the Accumulo user

Table 11. Sqoop Import – Code generation arguments

Argument Description
–bindir <dir> Output directory for compiled objects

Sets the generated class name. This overrides –

–class-name <name> package-name. When combined with –jar-file,
sets the input class.

–jar-file <file> Disable code generation; use specified jar

–outdir <dir> Output directory for generated code
–package-name <name> Put auto-generated classes in this package

Override default mapping from SQL type to Java

–map-column-java <m>
type for configured columns.
Table 12. Sqoop Import – Additional import
configuration properties
Argument Description

Controls how BigDecimal columns will

formatted when stored as a String. A value of
true (default) will use toPlainString to store
sqoop.bigdecimal.format.string
them without an exponent component
(0.0000001); while a value of false will use
toString which may include an exponent (1E-7)

When set to false (default), Sqoop will not add

the column used as a row key into the row data
sqoop.hbase.add.row.key in HBase. When set to true, the column used as
a row key will be added to the row data in
HBase.
Introduction to NoSQL

• NoSQL is a type of database management system (DBMS) that is

designed to handle and store large volumes of unstructured and
semi-structured data.
• Unlike traditional relational databases that use tables with pre-
defined schemas to store data, NoSQL databases use flexible data
models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of
data.
• The term NoSQL originally referred to “non-SQL” or “non-
relational” databases, but the term has since evolved to mean
“not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
NoSQL databases are generally classified into four main
categories:

1.Document databases: These databases store data as semi-structured

documents, such as JSON or XML, and can be queried using
document-oriented query languages.
2.Key-value stores: These databases store data as key-value pairs, and
are optimized for simple and fast read/write operations.
3.Column-family stores: These databases store data as column
families, which are sets of columns that are treated as a single entity.
They are optimized for fast and efficient querying of large amounts of
data.
4.Graph databases: These databases store data as nodes and edges,
and are designed to handle complex relationships between data.
Key Features of NoSQL :

1.Dynamic schema: NoSQL databases do not have a fixed schema and

can accommodate changing data structures without the need for
migrations or schema alterations.
2.Horizontal scalability: NoSQL databases are designed to scale out by
adding more nodes to a database cluster, making them well-suited for
handling large amounts of data and high levels of traffic.
3.Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in semi-structured
format, such as JSON or BSON.
4.Key-value-based: Other NoSQL databases, such as Redis, use a key-
value data model, where data is stored as a collection of key-value
pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use
a column-based data model, where data is organized into columns
instead of rows.
6. Distributed and high availability: NoSQL databases are often
designed to be highly available and to automatically handle node
failures and data replication across multiple nodes in a database
cluster.
7. Flexibility: NoSQL databases allow developers to store and
retrieve data in a flexible and dynamic manner, with support for
multiple data types and changing data structures.
8. Performance: NoSQL databases are optimized for high
performance and can handle a high volume of reads and writes,
making them suitable for big data and real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL
databases such as MongoDB and Cassandra. The main advantages are high
scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and placing it on
multiple machines in such a way that the order of the data is preserved is sharding. Vertical scaling means
adding more resources to the existing machine whereas horizontal scaling means adding more machines to
handle the data. Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount
of data because of scalability, as the data grows NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which means that
they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit for
applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly available because in case of
any failure data replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts of data
and traffic with ease. This makes them a good fit for applications that need to handle large amounts of data or
traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic, which means that
they can offer improved performance compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational databases, as
they are typically less complex and do not require expensive hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.

1. Lack of standardization : There are many different types of NoSQL databases, each with its own unique strengths and
weaknesses. This lack of standardization can make it difficult to choose the right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that they do not guarantee the
consistency, integrity, and durability of data. This can be a drawback for applications that require strong data consistency
guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but it provides very little
functionality. Relational databases are a better choice in the field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words, two database systems
are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle complex queries, which means that they are
not a good fit for applications that require complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional relational databases. This can make
them less reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a large amount of data as simple as possible.
But it is not so easy. Data management in NoSQL is much more complex than in a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup of
data in a consistent manner.
10.Large document size : Some database systems like MongoDB and CouchDB store data in JSON format. This means that
documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts since they
increase the document size.
Overview of HBase
• HBase is a data model that is similar to Google’s big table. It is an open source, distributed
database developed by Apache software foundation written in Java. HBase is an essential part of
our Hadoop ecosystem. HBase runs on top of HDFS (Hadoop Distributed File System). It can store
massive amounts of data from terabytes to petabytes. It is column oriented and horizontally
scalable
• History of HBase
HBase Architecture
HBase architecture has 3 main components: HMaster,
Region Server, Zookeeper.
• All the 3 components are described below:

1. HMaster –

The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned to
region server as well as DDL (create, delete table) operations. It monitor all Region Server instances present in
the cluster. In a distributed environment, Master runs several background threads. HMaster has many features
like controlling load balancing, failover etc.

2. Region Server –

HBase Tables are divided horizontally by row key range into Regions. Regions are the basic building elements
of HBase cluster that consists of the distribution of tables and are comprised of Column families. Region Server
runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region Server are responsible for
several things, like handling, managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.

3. Zookeeper –

It is like a coordinator in HBase. It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc. Clients communicate with region servers
via zookeeper.
• Advantages of HBase –

1. Can store large data sets

2. Database can be shared

3. Cost-effective from gigabytes to petabytes

4. High availability through failover and replication

• Disadvantages of HBase –

1. No support SQL structure

2. No transaction support

3. Sorted only on key

4. Memory issues on the cluster

• Features of HBase architecture :
• Distributed and Scalable: HBase is designed to be distributed and scalable, which means it can
handle large datasets and can scale out horizontally by adding more nodes to the cluster.
• Column-oriented Storage: HBase stores data in a column-oriented manner, which means data is
organized by columns rather than rows. This allows for efficient data retrieval and aggregation.
• Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage Hadoop’s
distributed file system (HDFS) for storage and MapReduce for data processing.
• Consistency and Replication: HBase provides strong consistency guarantees for read and write
operations, and supports replication of data across multiple nodes for fault tolerance.
• Built-in Caching: HBase has a built-in caching mechanism that can cache frequently accessed
data in memory, which can improve query performance.
• Compression: HBase supports compression of data, which can reduce storage requirements and
improve query performance.
• Flexible Schema: HBase supports flexible schemas, which means the schema can be updated on
the fly without requiring a database schema migration.
• Note – HBase is extensively used for online analytical operations, like in banking applications
such as real-time data updates in ATM machines, HBase can be used.
CRUD Operations
• CRUD operations act as the foundation of any computer
programming language or technology. So before taking a
deeper dive into any programming language or technology,
one must be proficient in working on its CRUD operations.
This same rule applies to databases as well.
1. Create:

• In CRUD operations, 'C' is an acronym for create, which means to add or

insert data into the SQL table. So, firstly we will create a table using CREATE
command and then we will use the INSERT INTO command to insert rows in
the created table.
• Syntax for table creation:
CREATE TABLE Table_Name (ColumnName1 Datatype,
Colu mnName2 Datatype,..., ColumnNameN Datatype)
Where
• Table_Name is the name that we want to assign to the table.
• Column_Name is the attributes under which we want to store data of the table.
• Datatype is assigned to each column. Datatype decides the type of data that
will be stored in the respective column.
2. Read:
• In CRUD operations, 'R' is an acronym for read, which means retrieving
or fetching the data from the SQL table. So, we will use the SELECT
command to fetch the inserted records from the SQL table. We can retrieve
all the records from a table using an asterisk (*) in a SELECT query. There
is also an option of retrieving only those records which satisfy a particular
condition by using the WHERE clause in a SELECT query.
Syntax to fetch all the records:

1.SELECT *FROM TableName;

Syntax to fetch records according to the condition:

1.SELECT *FROM TableName WHERE CONDITION;

3. Update:

• In CRUD operations, 'U' is an acronym for the update, which means making
updates to the records present in the SQL tables. So, we will use the
UPDATE command to make changes in the data present in tables.
Syntax:

1.UPDATE Table_Name SET ColumnName = Value WHERE CONDITION;

4. Delete:

• In CRUD operations, 'D' is an acronym for delete, which

means removing or deleting the records from the SQL tables. We can
delete all the rows from the SQL tables using the DELETE query.
There is also an option to remove only the specific records that satisfy
a particular condition by using the WHERE clause in a DELETE
query.
Syntax to delete all the records:

1.DELETE FROM TableName;

Syntax to delete records according to the condition:

2.DELETE FROM TableName WHERE CONDITION;

THE END

BAED-AI2121-2322S-Written Work 1-4th Quarter Grade 12
100% (1)
BAED-AI2121-2322S-Written Work 1-4th Quarter Grade 12
5 pages
Ransomware Infection Protection
No ratings yet
Ransomware Infection Protection
14 pages
SQOOP
No ratings yet
SQOOP
8 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Module 5 - Sqoop
No ratings yet
Module 5 - Sqoop
25 pages
Unit 3 Apache Sqoop and Drill
No ratings yet
Unit 3 Apache Sqoop and Drill
10 pages
U Iv Sqoop 1
No ratings yet
U Iv Sqoop 1
20 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
Sqoop - A Haddop Technology: Srikalahasti
No ratings yet
Sqoop - A Haddop Technology: Srikalahasti
13 pages
Unit 4 3 Lumify, Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify, Data Rapper and Sqooop
27 pages
Unit 3 Topic 8 Flume and Scoop
No ratings yet
Unit 3 Topic 8 Flume and Scoop
35 pages
160 P16cse5a-P16ite3a 2020052411232116
No ratings yet
160 P16cse5a-P16ite3a 2020052411232116
13 pages
BDA Lab2
No ratings yet
BDA Lab2
8 pages
SQOOP
No ratings yet
SQOOP
6 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
Bda 11
No ratings yet
Bda 11
10 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
DSCI 5350 - Lecture 3 PDF
No ratings yet
DSCI 5350 - Lecture 3 PDF
39 pages
Sqoop
No ratings yet
Sqoop
28 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
Sqoop
No ratings yet
Sqoop
4 pages
Sqooprequestfiles
No ratings yet
Sqooprequestfiles
7 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
90 pages
Practice Assignment
No ratings yet
Practice Assignment
3 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Practice Assignment
No ratings yet
Practice Assignment
4 pages
Sqoop
No ratings yet
Sqoop
9 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
6 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
04 Sqoop
No ratings yet
04 Sqoop
30 pages
Gold Video Task Complted
No ratings yet
Gold Video Task Complted
31 pages
Zep Sqoop Big Data Interview Questions
No ratings yet
Zep Sqoop Big Data Interview Questions
25 pages
SqoopTutorial Ver 2.0
No ratings yet
SqoopTutorial Ver 2.0
51 pages
Unit 6
No ratings yet
Unit 6
26 pages
BD Sqltohadoop3 PDF
No ratings yet
BD Sqltohadoop3 PDF
13 pages
Sqoopintro
No ratings yet
Sqoopintro
2 pages
Scoop PPT
No ratings yet
Scoop PPT
3 pages
Intro
No ratings yet
Intro
2 pages
Experiment-5 (Case Study On Sqoop)
No ratings yet
Experiment-5 (Case Study On Sqoop)
5 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Apache Sqoop: Hanoi - Autumn 2019
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
18 pages
Module IV
No ratings yet
Module IV
5 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
Sqoop 2
No ratings yet
Sqoop 2
10 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Sqoop Students Datadotz
No ratings yet
Sqoop Students Datadotz
19 pages
Bda Exp8 Chinmay
No ratings yet
Bda Exp8 Chinmay
6 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
BigData - Sem 4 - Elective 1 - Module 2 - PPT
No ratings yet
BigData - Sem 4 - Elective 1 - Module 2 - PPT
29 pages
Sqoop Additional Reading Pp-200913-222451-Unlocked
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
18 pages
Az 3
No ratings yet
Az 3
19 pages
M - M - Num-Mappers
No ratings yet
M - M - Num-Mappers
4 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
5 - Big - Data Vivek
No ratings yet
5 - Big - Data Vivek
4 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
Scoop Intro
No ratings yet
Scoop Intro
9 pages
Apache Sqoop: Vasanth B 2019202060
No ratings yet
Apache Sqoop: Vasanth B 2019202060
10 pages
Sqoop v1.1
No ratings yet
Sqoop v1.1
18 pages
Sqoop
No ratings yet
Sqoop
15 pages
Bda U4
No ratings yet
Bda U4
49 pages
Bda U1
No ratings yet
Bda U1
78 pages
Bda U2
No ratings yet
Bda U2
79 pages
Bda U5
No ratings yet
Bda U5
42 pages
103 Changes To The MP Post
No ratings yet
103 Changes To The MP Post
12 pages
Rashmi Gupta
No ratings yet
Rashmi Gupta
3 pages
Assignment 4
No ratings yet
Assignment 4
10 pages
AWS DevOps Interview Q&A
No ratings yet
AWS DevOps Interview Q&A
5 pages
BCA 103 - Mathematical Foundation of Computer SC - BCA
100% (2)
BCA 103 - Mathematical Foundation of Computer SC - BCA
274 pages
Lect5 UWA
No ratings yet
Lect5 UWA
93 pages
Seño, Judy Ann F
No ratings yet
Seño, Judy Ann F
4 pages
Performance&Scalability Ch3
No ratings yet
Performance&Scalability Ch3
41 pages
Hotstar Debugger1
No ratings yet
Hotstar Debugger1
4 pages
Commented Tomike Famoroti Dissertation Draft. (Ecommerce Security)
No ratings yet
Commented Tomike Famoroti Dissertation Draft. (Ecommerce Security)
178 pages
GRC Training - Terminology
0% (1)
GRC Training - Terminology
13 pages
A Reinforcement Learning Approach To Job-Shop Scheduling: Wei Zhang Thomas G. Dietterich
No ratings yet
A Reinforcement Learning Approach To Job-Shop Scheduling: Wei Zhang Thomas G. Dietterich
7 pages
Obi Observations
No ratings yet
Obi Observations
16 pages
Microsoft Industry Reference Architecture For Banking (MIRA-B)
No ratings yet
Microsoft Industry Reference Architecture For Banking (MIRA-B)
62 pages
How To Find The Where Used List of Query Restrictions
No ratings yet
How To Find The Where Used List of Query Restrictions
14 pages
SAX (Simple API For XML)
No ratings yet
SAX (Simple API For XML)
16 pages
Administrator (CCSA) R80 156-215-80 Check Point Certified Security
No ratings yet
Administrator (CCSA) R80 156-215-80 Check Point Certified Security
13 pages
Vidhi Shrivastava - SR Project Manager CV
No ratings yet
Vidhi Shrivastava - SR Project Manager CV
6 pages
Isodraft User Guide PDF
No ratings yet
Isodraft User Guide PDF
137 pages
IT2042 Info Sec UNIT V NOTES
No ratings yet
IT2042 Info Sec UNIT V NOTES
13 pages
Computational Thinking (MIT Press Essential Knowledge Series) Peter J. Denning 2024 Scribd Download
100% (3)
Computational Thinking (MIT Press Essential Knowledge Series) Peter J. Denning 2024 Scribd Download
37 pages
Create A Demon Warriors PvE Tier List - TierMaker
No ratings yet
Create A Demon Warriors PvE Tier List - TierMaker
1 page
Lab in Sem - Advanced Excel For Data Science
No ratings yet
Lab in Sem - Advanced Excel For Data Science
2 pages
Voilence in Gaming
No ratings yet
Voilence in Gaming
23 pages
I O Extended 2023 GDO Discord
No ratings yet
I O Extended 2023 GDO Discord
20 pages
Jimma Institute of Technology: Product Design Lecture-II
No ratings yet
Jimma Institute of Technology: Product Design Lecture-II
18 pages
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
No ratings yet
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
54 pages
Midterm Project Report
No ratings yet
Midterm Project Report
39 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.