0% found this document useful (0 votes)
9 views59 pages

Bda U3

Uploaded by

snehapallap18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views59 pages

Bda U3

Uploaded by

snehapallap18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT-III

 SQOOP: Introduction to SQOOP


 SQOOP imports: From Database to HDFS/Hive
 SQOOP exports: From HDFS/Hive to Database
 Incremental imports
 NoSQL &HBase: Overview
 HBase architecture
 CRUD operations
Big Data tool, which we use for transferring data between Hadoop and
relational database servers is what we call Sqoop.

• What is Apache Sqoop?


• An open-source data integration programme called Apache Sqoop is
intended to make it easier to move data between Apache Hadoop and
conventional relational databases or other structured data repositories.
• The difficulty of effectively integrating data from external systems into
Hadoop’s distributed file system (HDFS) and exporting processed or
analyzed data back to relational databases for use in business intelligence
or reporting tools is addressed.
• Data import from several relational databases, including MySQL, Oracle,
SQL Server, and PostgreSQL, into HDFS is one of Sqoop’s core
functionalities.
• It enables incremental imports, allowing users to import just
the new or changed records since the last import, minimising
data transfer time and guaranteeing data consistency. Parallel
imports are supported, enabling the efficient transfer of big
datasets.
• When it comes to exporting, Sqoop makes it possible to send
processed or analysed data from HDFS back to relational
databases, guaranteeing that the knowledge obtained from
big data analysis can be incorporated into current data
warehousing systems without any difficulty.
• Basically, Sqoop (“SQL-to-Hadoop”) is a straightforward
command-line tool. It offers the following capabilities:

1.Generally, helps to Import individual tables or entire


databases to files in HDFS
2.Also can Generate Java classes to allow you to interact with
your imported data
3.Moreover, it offers the ability to import from SQL databases
straight into your Hive data warehouse.
• Key Features of Sqoop
There are many salient features of Sqoop, which shows us the
several reasons to learn sqoop.

a. Parallel import/export
While it comes to import and export the data, Sqoop uses
YARN framework. Basically, that offers fault tolerance on
top of parallelism.
b. Connectors for all major RDBMS Databases
However, for multiple RDBMS databases, Sqoop offers
connectors, covering almost the entire circumference.
c. Import results of SQL query
Also, in HDFS, we can import the result returned from an SQL query.
d. Incremental Load
Moreover, we can load parts of table whenever it is updated. Since
Sqoop offers the facility of the incremental load.
e. Full Load
It is one of the important features of sqoop, in which we can load
the whole table by a single command in Sqoop. Also, by using a
single command we can load all the tables from a database.
f. Kerberos Security Integration
Basically, Sqoop supports Kerberos authentication. Where Kerberos
defined as a computer network authentication protocol. That works
on the basis of ‘tickets’ to allow nodes communicating over a non-
secure network to prove their identity to one another in a secure manner.
g. Load data directly into HIVE/HBase
Basically, for analysis, we can load data directly into Apache
Hive. Also, can dump your data in HBase, which is a NoSQL
database.
h. Compression
By using deflate(gzip) algorithm with –compress argument,
We can compress your data. Moreover, it is also possible by
specifying –compression-codec argument. In addition, we can also
load compressed table in Apache Hive.
i. Support for Accumulo
It is possible that rather than a directory in HDFS we can instruct
Sqoop to import the table in Accumulo.
Sqoop Architecture and Working
By using the below diagram, Let’s understand Apache Sqoop
Architecture and how Sqoop import works internally:
Sqoop Architecture and Working

• By using the below diagram, Let’s understand Apache


Sqoop Architecture and how Sqoop Export works
internally
• Basically, a tool which imports individual tables from RDBMS to HDFS is what we call
Sqoop import tool. However, in HDFS we treat each row in a table as a record.

• Moreover, our main task gets divided into subtasks, while we submit Sqoop command.
However, map task individually handles it internally.
• On defining map task, it is the subtask that imports part of data to the Hadoop Ecosystem.
Likewise, we can say all map tasks import the whole data collectively.
However, Export also works in the same way.
• A tool which exports a set of files from HDFS back to an RDBMS is a Sqoop Export tool.
Moreover, there are files which behave as input to Sqoop which also contain records. Those files
what we call as rows in the table.

• Moreover, the job is mapped into map tasks, while we submit our job, that brings the chunk of data
from HDFS. Then we export these chunks to a structured data destination.
• Likewise, we receive the whole data at the destination by combining all these exported chunks of
• In addition, in case of aggregations, we require reducing
phase. However, Sqoop does not perform any aggregations it
just imports and exports the data. Also, on the basis of the
number defined by the user, map job launch multiple
mappers.
• In addition, each mapper task will be assigned with a part of
data to be imported for Sqoop import. Also, to get high-
performance Sqoop distributes the input data among the
mappers equally. Afterwards, by using JDBC each mapper
creates the connection with the database. Also fetches the
part of data assigned by Sqoop. Moreover, it writes it into
HDFS or Hive or HBase on the basis of arguments provided
in the CLI.
Sqoop Import- Importing Data From RDBMS to HDFS
To import data into HDFS we use the following syntax for
importing in Sqoop. Such as:
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

The very advantage is we can type the sqoop import arguments in


any order with respect to one another. However, when it comes to
the Hadoop generic arguments, those must precede any import
arguments only.
Basically, here all the arguments are grouped into collections
which are organized by function. However, some collections are
present in several tools here. For example, the “common”
Table 1. Sqoop Import – Common arguments
Argument Description
–connect <jdbc-uri> Specify JDBC connect string
Specify connection manager class to
–connection-manager <class-name>
use
Manually specify JDBC driver class to
–driver <class-name>
use
Override
–hadoop-mapped-home <dir>
$HADOOP_MAPRED_HOME
–help Print usage instructions
Set path for a file containing the
–password-file
authentication password
-P Read password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
Optional properties file that provides
–connection-param-file <filename>
connection parameters
Set connection transaction isolation to
–relaxed-isolation
read uncommitted for the mappers.
• a. Connecting to a Database Server
Sqoop is designed to import tables from a database into HDFS. To do
so, you must specify a connect string that describes how to connect to
the database. The connect string is similar to a URL, and is
communicated to Sqoop with the –connect argument. That defines the
server and database to connect to; also specify the port.
• For example:

$ sqoop import –connect jdbc:mysql://database.example.com/employees


Table 2. Sqoop Import – Validation arguments

Argument Description

Enable validation of data copied, supports


–validate
single table copy only.

–validator <class-name> Specify validator class to use.

–validation-threshold <class-name> Specify validation threshold class to use.

–validation-failurehandler <class-name> Specify validation failure handler class to use.


Table 3. Sqoop Import – Import control arguments

Argument Description

–append Append data to an existing dataset in HDFS

–as-avrodatafile Imports data to Avro Data Files

–as-sequencefile Imports data to SequenceFiles

–as-textfile Imports data as plain text (default)

–as-parquetfile Imports data to Parquet Files

–boundary-query <statement> Boundary query to use for creating splits

–columns <col,col,col…> Columns to import from table


–delete-target-dir Delete the import target directory if it exists

–direct Use direct connector if exists for the database

–fetch-size <n> Number of entries to read from database at once.

–inline-lob-limit <n> Set the maximum size for an inline LOB


-m,–num-mappers <n> Use n map tasks to import in parallel
-e,–query <statement> Import the results of statement.
Column of the table used to split work units.
–split-by <column-name> Cannot be used with –autoreset-to-one-
mapperoption.

Import should use one mapper if a table has no


–autoreset-to-one-mapper primary key and no split-by column is provided.
Cannot be used with –split-by <col> option.
–table <table-name> Table to read

–target-dir <dir> HDFS destination dir

–warehouse-dir <dir> HDFS parent for table destination

–where <where clause> WHERE clause to use during import

-z,–compress Enable compression

–compression-codec <c> Use Hadoop codec (default gzip)

The string to be written for a null value for string


–null-string <null-string>
columns

The string to be written for a null value for non-


–null-non-string <null-string>
string columns

Although, both –null-string and –null-non-string arguments are optional. However,


we use the string “null” if not specified.
Table 4. Parameters for overriding mapping

Argument Description
Override mapping from SQL to Java type for
–map-column-java <mapping>
configured columns.

Override mapping from SQL to Hive type for


–map-column-hive <mapping>
configured columns.

Basically, Sqoop is expecting the comma-separated list of mapping in


the form <name of column>=<new type>.
For example:
$ sqoop import … –map-column-java id=String,value=Integer
Also, Sqoop will raise the exception in case that some configured
mapping will not be used.
Sqoop Import Example
Basically, we will understand how to use the import tool in a variety of
situations by the following examples.

In addition, a basic import of a table named EMPLOYEES in the


corp database:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table


EMPLOYEES

Also, a basic import requiring a login:


$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table
EMPLOYEES \
–username SomeUser -P
Enter password: (hidden)
• So selecting specific columns from the EMPLOYEES table:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


–columns “employee_id,first_name,last_name,job_title”
• Controlling the import parallelism (using 8 parallel tasks):

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


-m 8
Storing data in SequenceFiles, and setting the generated class name to
com.foocorp.Employee:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


–class-name com.foocorp.Employee –as-sequencefile
• Also, specifying the delimiters to use in a text-mode import:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


–fields-terminated-by ‘\t’ –lines-terminated-by ‘\n’ \
–optionally-enclosed-by ‘\”‘

Basically here, importing the data to Hive:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


–hive-import
Also, here, only importing new employees:

$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \


–where “start_date > ‘2010-01-01′”
Afterwards, changing the splitting column from the default:
$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \
–split-by dept_id
Sqoop Export – Exporting From HDFS to RDBMS

After Sqoop Import, there is a tool which exports a set of files from HDFS
back to RDBMS, that tool is what we call an Export Tool in Apache Sqoop.

Sqoop Export Syntax


$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)

However, the export arguments can be entered in any order with respect to
one another, but the Hadoop generic arguments must precede any export
arguments.
Table 1. Common arguments
Argument Description
–connect <jdbc-uri> Specify JDBC connect string
–connection-manager <class-name> Specify connection manager class to use

–driver <class-name> Manually specify JDBC driver class to use

–hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME


–help Print usage instructions
Set path for a file containing the authentication
–password-file
password
-P Read password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
Optional properties file that provides connection
–connection-param-file <filename>
parameters
Table 2. Validation arguments More Details

Argument Description

Enable validation of data copied, supports single


–validate
table copy only.

–validator <class-name> Specify validator class to use.

–validation-threshold <class-name> Specify validation threshold class to use.

–validation-failurehandler <class-name> Specify validation failure handler class to use.


Table 3. Export control arguments:

Argument Description
–columns <col,col,col…> Columns to export to table
–direct Use direct export fast path
–export-dir <dir> HDFS source path for the export
-m,–num-mappers <n> Use n map tasks to export in parallel
–table <table-name> Table to populate
–call <stored-proc-name> Stored Procedure to call
Anchor column to use for updates. Use a comma
–update-key <col-name> separated list of columns if there are more than
one column.
Specify how updates are performed when new
–update-mode <mode> rows are found with non-matching keys in
database.
Legal values for mode include updateonly
(default) and allowinsert.
The string to be interpreted as null for string
–input-null-string <null-string>
columns

The string to be interpreted as null for non-string


–input-null-non-string <null-string>
columns

The table in which data will be staged before


–staging-table <staging-table-name>
being inserted into the destination table.

Indicates that any data present in the staging


–clear-staging-table
table can be deleted.

Use batch mode for underlying statement


–batch
execution.
Table 4. Input parsing arguments:

Argument Description

–input-enclosed-by <char> Sets a required field encloser

–input-escaped-by <char> Sets the input escape character

–input-fields-terminated-by <char> Sets the input field separator

–input-lines-terminated-by <char> Sets the input end-of-line character

–input-optionally-enclosed-by <char> Sets a field enclosing character


Table 5. Output line formatting arguments:
Argument Description
–enclosed-by <char> Sets a required field enclosing character

–escaped-by <char> Sets the escape character

–fields-terminated-by <char> Sets the field separator character

–lines-terminated-by <char> Sets the end-of-line character

Uses MySQL’s default delimiter set: fields: , lines:


–mysql-delimiters
\n escaped-by: \ optionally-enclosed-by:

–optionally-enclosed-by <char> Sets a field enclosing character

Moreover, Sqoop will fail to find enough columns per line, if we specify incorrect
delimiters. Basically, that will cause export map tasks to fail by throwing
ParseExceptions.
Table 6. Code generation arguments:
Argument Description
–bindir <dir> Output directory for compiled objects
Sets the generated class name. This
overrides –package-name. When
–class-name <name>
combined with –jar-file, sets the input
class.
–jar-file <file> Disable code generation; use specified jar
–outdir <dir> Output directory for generated code
–package-name <name> Put auto-generated classes in this package

Override default mapping from SQL type


–map-column-java <m>
to Java type for configured columns.
Sqoop Exports Example

• To populate a table named bar, a basic export in Sqoop is:


$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar \
–export-dir /results/bar_data
• To populate a table named bar with validation enabled: More Details,
another basic Sqoop export

$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar \


–export-dir /results/bar_data –validate
For every record in /results/bar_data an export that calls a stored procedure
named barproc would look like:
$ sqoop export –connect jdbc:mysql://db.example.com/foo –call barproc \
–export-dir /results/bar_data
Incremental Imports
• There is an incremental import mode offered by Sqoop. That can be used
to retrieve only rows newer than some previously imported set of rows.
The following arguments control incremental imports in sqoop:
Table 5. Sqoop Import – Incremental import arguments
Argument Description
Specifies the column to be examined when
determining which rows to import. (the column
–check-column (col) should not be of type
CHAR/NCHAR/VARCHAR/VARNCHAR/
LONGVARCHAR/LONGNVARCHAR)
Specifies how Sqoop determines which rows are
–incremental (mode) new. Legal values for mode include append and
lastmodified.
Specifies the maximum value of the check column
–last-value (value)
from the previous import.
Basically, there are two types of incremental
imports in Sqoop.
One is appended and
second is last modified.

Moreover, to specify the type of incremental


import to perform, we can also use the –
incremental argument.
Table 6. Sqoop Import – Output line formatting arguments

Argument Description
–enclosed-by <char> Sets a required field enclosing character

–escaped-by <char> Sets the escape character

–fields-terminated-by <char> Sets the field separator character

–lines-terminated-by <char> Sets the end-of-line character

Uses MySQL’s default delimiter set: fields: ,


–mysql-delimiters
lines: \n escaped-by: \ optionally-enclosed-by: ‘

–optionally-enclosed-by <char> Sets a field enclosing character


Table 7. Sqoop Import – Input parsing arguments

Argument Description

–input-enclosed-by <char> Sets a required field encloser

–input-escaped-by <char> Sets the input escape character

–input-fields-terminated-by <char> Sets the input field separator

–input-lines-terminated-by <char> Sets the input end-of-line character

–input-optionally-enclosed-by <char> Sets a field enclosing character


Table 8. Sqoop Import – Hive arguments
Argument Description
–hive-home <dir> Override $HIVE_HOME
Import tables into Hive (Uses Hive’s default
–hive-import
delimiters if none are set.)
–hive-overwrite Overwrite existing data in the Hive table.
If set, then the job will fail if the target hive table
–create-hive-table
exits. By default this property is false.

–hive-table <table-name> Sets the table name to use when importing to Hive.
Drops \n, \r, and \01 from string fields when
–hive-drop-import-delims
importing to Hive.
Replace \n, \r, and \01 from string fields with user
–hive-delims-replacement
defined string when importing to Hive.
–hive-partition-key Name of a hive field to partition are sharded on
String-value that serves as partition key for this
–hive-partition-value <v>
imported into hive in this job.
Override default mapping from SQL type to Hive
–map-column-hive <map>
type for configured columns.
Table 9. Sqoop Import – HBase arguments

Argument Description
–column-family <family> Sets the target column family for the import

–hbase-create-table If specified, create missing HBase tables

Specifies which input column to use as the row


key. In case, if input table contains composite key,
–hbase-row-key <col>
then <col> must be in the form of a comma-
separated list of composite key attributes.

Specifies an HBase table to use as the target


–hbase-table <table-name>
instead of HDFS

–hbase-bulkload Enables bulk loading


Table 10. Sqoop Import – Accumulo arguments

Argument Description
Specifies an Accumulo table to use as the target
–accumulo-table <table-nam>
instead of HDFS

–accumulo-column-family <family> Sets the target column family for the import

–accumulo-create-table If specified, create missing Accumulo tables

Specifies which input column to use as the row


–accumulo-row-key <col>
key

(Optional) Specifies a visibility token to apply to


–accumulo-visibility <vis> all rows inserted into Accumulo. Default is the
empty string.

(Optional) Sets the size in bytes of Accumulo’s


–accumulo-batch-size <size>
write buffer. Default is 4MB.
(Optional) Sets the max latency in milliseconds
–accumulo-max-latency <ms>
for the Accumulo batch writer. Default is 0.

Comma-separated list of Zookeeper servers


–accumulo-zookeepers <host:port>
used by the Accumulo instance

–accumulo-instance <table-name> Name of the target Accumulo instance

–accumulo-user <username> Name of the Accumulo user to import as

–accumulo-password <password> Password for the Accumulo user


Table 11. Sqoop Import – Code generation arguments

Argument Description
–bindir <dir> Output directory for compiled objects

Sets the generated class name. This overrides –


–class-name <name> package-name. When combined with –jar-file,
sets the input class.

–jar-file <file> Disable code generation; use specified jar


–outdir <dir> Output directory for generated code
–package-name <name> Put auto-generated classes in this package

Override default mapping from SQL type to Java


–map-column-java <m>
type for configured columns.
Table 12. Sqoop Import – Additional import
configuration properties
Argument Description

Controls how BigDecimal columns will


formatted when stored as a String. A value of
true (default) will use toPlainString to store
sqoop.bigdecimal.format.string
them without an exponent component
(0.0000001); while a value of false will use
toString which may include an exponent (1E-7)

When set to false (default), Sqoop will not add


the column used as a row key into the row data
sqoop.hbase.add.row.key in HBase. When set to true, the column used as
a row key will be added to the row data in
HBase.
Introduction to NoSQL

• NoSQL is a type of database management system (DBMS) that is


designed to handle and store large volumes of unstructured and
semi-structured data.
• Unlike traditional relational databases that use tables with pre-
defined schemas to store data, NoSQL databases use flexible data
models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of
data.
• The term NoSQL originally referred to “non-SQL” or “non-
relational” databases, but the term has since evolved to mean
“not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
NoSQL databases are generally classified into four main
categories:

1.Document databases: These databases store data as semi-structured


documents, such as JSON or XML, and can be queried using
document-oriented query languages.
2.Key-value stores: These databases store data as key-value pairs, and
are optimized for simple and fast read/write operations.
3.Column-family stores: These databases store data as column
families, which are sets of columns that are treated as a single entity.
They are optimized for fast and efficient querying of large amounts of
data.
4.Graph databases: These databases store data as nodes and edges,
and are designed to handle complex relationships between data.
Key Features of NoSQL :

1.Dynamic schema: NoSQL databases do not have a fixed schema and


can accommodate changing data structures without the need for
migrations or schema alterations.
2.Horizontal scalability: NoSQL databases are designed to scale out by
adding more nodes to a database cluster, making them well-suited for
handling large amounts of data and high levels of traffic.
3.Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in semi-structured
format, such as JSON or BSON.
4.Key-value-based: Other NoSQL databases, such as Redis, use a key-
value data model, where data is stored as a collection of key-value
pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use
a column-based data model, where data is organized into columns
instead of rows.
6. Distributed and high availability: NoSQL databases are often
designed to be highly available and to automatically handle node
failures and data replication across multiple nodes in a database
cluster.
7. Flexibility: NoSQL databases allow developers to store and
retrieve data in a flexible and dynamic manner, with support for
multiple data types and changing data structures.
8. Performance: NoSQL databases are optimized for high
performance and can handle a high volume of reads and writes,
making them suitable for big data and real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL
databases such as MongoDB and Cassandra. The main advantages are high
scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and placing it on
multiple machines in such a way that the order of the data is preserved is sharding. Vertical scaling means
adding more resources to the existing machine whereas horizontal scaling means adding more machines to
handle the data. Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount
of data because of scalability, as the data grows NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which means that
they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit for
applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly available because in case of
any failure data replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts of data
and traffic with ease. This makes them a good fit for applications that need to handle large amounts of data or
traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic, which means that
they can offer improved performance compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational databases, as
they are typically less complex and do not require expensive hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.

1. Lack of standardization : There are many different types of NoSQL databases, each with its own unique strengths and
weaknesses. This lack of standardization can make it difficult to choose the right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that they do not guarantee the
consistency, integrity, and durability of data. This can be a drawback for applications that require strong data consistency
guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but it provides very little
functionality. Relational databases are a better choice in the field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words, two database systems
are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle complex queries, which means that they are
not a good fit for applications that require complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional relational databases. This can make
them less reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a large amount of data as simple as possible.
But it is not so easy. Data management in NoSQL is much more complex than in a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup of
data in a consistent manner.
10.Large document size : Some database systems like MongoDB and CouchDB store data in JSON format. This means that
documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts since they
increase the document size.
Overview of HBase
• HBase is a data model that is similar to Google’s big table. It is an open source, distributed
database developed by Apache software foundation written in Java. HBase is an essential part of
our Hadoop ecosystem. HBase runs on top of HDFS (Hadoop Distributed File System). It can store
massive amounts of data from terabytes to petabytes. It is column oriented and horizontally
scalable
• History of HBase
HBase Architecture
HBase architecture has 3 main components: HMaster,
Region Server, Zookeeper.
• All the 3 components are described below:

1. HMaster –

The implementation of Master Server in HBase is HMaster. It is a process in which regions are assigned to
region server as well as DDL (create, delete table) operations. It monitor all Region Server instances present in
the cluster. In a distributed environment, Master runs several background threads. HMaster has many features
like controlling load balancing, failover etc.

2. Region Server –

HBase Tables are divided horizontally by row key range into Regions. Regions are the basic building elements
of HBase cluster that consists of the distribution of tables and are comprised of Column families. Region Server
runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region Server are responsible for
several things, like handling, managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.

3. Zookeeper –

It is like a coordinator in HBase. It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc. Clients communicate with region servers
via zookeeper.
• Advantages of HBase –

1. Can store large data sets

2. Database can be shared

3. Cost-effective from gigabytes to petabytes

4. High availability through failover and replication

• Disadvantages of HBase –

1. No support SQL structure

2. No transaction support

3. Sorted only on key

4. Memory issues on the cluster


• Features of HBase architecture :
• Distributed and Scalable: HBase is designed to be distributed and scalable, which means it can
handle large datasets and can scale out horizontally by adding more nodes to the cluster.
• Column-oriented Storage: HBase stores data in a column-oriented manner, which means data is
organized by columns rather than rows. This allows for efficient data retrieval and aggregation.
• Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage Hadoop’s
distributed file system (HDFS) for storage and MapReduce for data processing.
• Consistency and Replication: HBase provides strong consistency guarantees for read and write
operations, and supports replication of data across multiple nodes for fault tolerance.
• Built-in Caching: HBase has a built-in caching mechanism that can cache frequently accessed
data in memory, which can improve query performance.
• Compression: HBase supports compression of data, which can reduce storage requirements and
improve query performance.
• Flexible Schema: HBase supports flexible schemas, which means the schema can be updated on
the fly without requiring a database schema migration.
• Note – HBase is extensively used for online analytical operations, like in banking applications
such as real-time data updates in ATM machines, HBase can be used.
CRUD Operations
• CRUD operations act as the foundation of any computer
programming language or technology. So before taking a
deeper dive into any programming language or technology,
one must be proficient in working on its CRUD operations.
This same rule applies to databases as well.
1. Create:

• In CRUD operations, 'C' is an acronym for create, which means to add or


insert data into the SQL table. So, firstly we will create a table using CREATE
command and then we will use the INSERT INTO command to insert rows in
the created table.
• Syntax for table creation:
CREATE TABLE Table_Name (ColumnName1 Datatype,
Colu mnName2 Datatype,..., ColumnNameN Datatype)
Where
• Table_Name is the name that we want to assign to the table.
• Column_Name is the attributes under which we want to store data of the table.
• Datatype is assigned to each column. Datatype decides the type of data that
will be stored in the respective column.
2. Read:
• In CRUD operations, 'R' is an acronym for read, which means retrieving
or fetching the data from the SQL table. So, we will use the SELECT
command to fetch the inserted records from the SQL table. We can retrieve
all the records from a table using an asterisk (*) in a SELECT query. There
is also an option of retrieving only those records which satisfy a particular
condition by using the WHERE clause in a SELECT query.
Syntax to fetch all the records:

1.SELECT *FROM TableName;

Syntax to fetch records according to the condition:

1.SELECT *FROM TableName WHERE CONDITION;


3. Update:

• In CRUD operations, 'U' is an acronym for the update, which means making
updates to the records present in the SQL tables. So, we will use the
UPDATE command to make changes in the data present in tables.
Syntax:

1.UPDATE Table_Name SET ColumnName = Value WHERE CONDITION;


4. Delete:

• In CRUD operations, 'D' is an acronym for delete, which


means removing or deleting the records from the SQL tables. We can
delete all the rows from the SQL tables using the DELETE query.
There is also an option to remove only the specific records that satisfy
a particular condition by using the WHERE clause in a DELETE
query.
Syntax to delete all the records:

1.DELETE FROM TableName;


Syntax to delete records according to the condition:

2.DELETE FROM TableName WHERE CONDITION;


THE END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy