B22 BDA Experiment 03
B22 BDA Experiment 03
LAB MANUAL
PART A
(PART A : TO BE REFFERED BY STUDENTS)
Experiment No-03
A.1 Aim:
To install Sqoop and execute basic commands of Hadoop ecosystem component Sqoop
A-2 Prerequisite
Knowledge Java, Phython and VMware software pack.
A.3 OutCome
Students will able to To acquire fundamental enabling technique and scalable
algorithms like Hadoop, Map Reduce and NO SQL in big data analytics.
A.4 Theory:
Introduction
Generally, applications interact with the relational database using RDBMS, and thus this
makes relational databases one of the most important sources that generate Big Data. Such
data is stored in RDB Servers in the relational structure. Here, Apache Sqoop plays an
important role in Hadoop ecosystem, providing feasible interaction between the relational
database server and HDFS.
So, Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS (Hadoop storage) and relational database servers like MySQL, Oracle RDB,
SQLite, Teradata, Netezza, Postgres etc. Apache Sqoop imports data from relational
databases to HDFS, and exports data from HDFS to relational databases. It efficiently
transfers bulk data between Hadoop and external data stores such as enterprise data
warehouses, relational databases, etc.
This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.
Additionally, Sqoop is used to import data from external datastores into Hadoop ecosystem’s
tools like Hive & HBase.
1. Full Load: Apache Sqoop can load the whole table by a single command. You can
also load all the tables from a database using a single command.
2. Incremental Load: Apache Sqoop also provides the facility of incremental load
where you can load parts of table whenever it is updated.
3. Parallel import/export: Sqoop uses YARN framework to import and export the data,
which provides fault tolerance on top of parallelism.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
4. Import results of SQL query: You can also import the result returned from an SQL
query in HDFS.
5. Compression: You can compress your data by using deflate(gzip) algorithm with –
compress argument, or by specifying –compression-codec argument. You can also
load compressed table in Apache Hive.
6. Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for
multiple RDBMS databases, covering almost the entire circumference.
7. Kerberos Security Integration: Kerberos is a computer network authentication
protocol which works on the basis of ‘tickets’ to allow nodes communicating over a
non-secure network to prove their identity to one another in a secure manner. Sqoop
supports Kerberos authentication.
8. Load data directly into HIVE/HBase: You can load data directly into Apache
Hive for analysis and also dump your data in HBase, which is a NoSQL database.
9. Support for Accumulo: You can also instruct Sqoop to import the table in Accumulo
rather than a directory in HDFS.
The architecture is one which is empowering Apache Sqoop with these benefits. Now, as we
know the features of Apache Sqoop, let’s move ahead and understand Apache Sqoop’s
architecture & working.
T
he import tool imports individual tables from RDBMS to HDFS. Each row in a table is
treated as a record in HDFS.
When we submit Sqoop command, our main task gets divided into subtasks which is handled
by individual Map Task internally. Map Task is the subtask, which imports part of data to the
Hadoop Ecosystem. Collectively, all Map tasks import the whole data.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
Export also
works in a similar manner.
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input
to Sqoop contain records, which are called as rows in the table.
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from
HDFS. These chunks are exported to a structured data destination. Combining all these
exported chunks of data, we receive the whole data at the destination, which in most of the
cases is an RDBMS (MYSQL/Oracle/SQL Server).
Reduce phase is required in case of aggregations. But, Apache Sqoop just imports and
exports the data; it does not perform any aggregations. Map job launch multiple mappers
depending on the number defined by the user. For Sqoop import, each mapper task will be
assigned with a part of data to be imported. Sqoop distributes the input data among the
mappers equally to get high performance. Then each mapper creates a connection with the
database using JDBC and fetches the part of data assigned by Sqoop and writes it into HDFS
or Hive or HBase based on the arguments provided in the CLI.
Now that we understand the architecture and working of Apache Sqoop, let’s understand the
difference between Apache Flume and Apache Sqoop.
Sqoop Commands
Sqoop – IMPORT Command
Import command is used to importing a table from relational databases to HDFS. In our case,
we are going to import tables from MySQL databases to HDFS.
As you can see in the below image, we have employees table in the employees database
which we will be importing into HDFS.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
As you can see in the below image, after executing this command Map tasks will be
executed at the back end.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
After the code is executed, you can check the Web UI of HDFS i.e. localhost:50070 where
the data is imported.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
You can also import the table in a specific directory in HDFS using the below command:
Sqoop imports data in parallel from most database sources. You can specify the number of
map tasks (parallel processes) to use to perform the import by using the -m or –num-
mappers argument. Each of these arguments takes an integer value which corresponds to the
degree of parallelism to employ.
You can control the number of mappers independently from the number of files present in the
directory. Export performance depends on the degree of parallelism. By default, Sqoop will
use four tasks in parallel for the export process. This may not be optimal, you will need to
experiment with your own particular setup. Additional tasks may offer better concurrency,
but if the database is already bottlenecked on updating indices, invoking triggers, and so on,
then additional load may decrease performance.
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
You can see in the below image, that the number of mapper task is 1.
The number of files that are created while importing MySQL tables is equal to the number of
mapper created.
You can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes
the corresponding SQL query in the respective database server and stores the result in a target
directory in HDFS. You can use the following command to import data with ‘where‘ clause:
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical. The soft copy must be uploaded on the Blackboard or emailed to the concerned
lab in charge faculties at the end of the practical in case the there is no Black board access
available)
Sqoop currently supports Hadoop version 2.6.0 or later. To install the Sqoop server,
decompress the tarball (in a location of your choosing) and set the newly created forder
as your working directory.
B.4 Conclusion:
We have executed basic commands of hadoop ecosystem sqoop.
Able to To acquire fundamental enabling technique and scalable algorithms
like Hadoop, Map Reduce and NO SQL in big data analytics.
Q1: What is the default file format to import data using Apache Sqoop?
Delimited Text File Format
Q2. How will you list all the columns of a table using Apache Sqoop?
Select * from Table_name;
VIKAS RAMPRAKASH CHAURASIYA TU3F2021091
ROLL NO:B22 BDA_EXP_03
Q3. Name a few import control commands. How can Sqoop handle large
objects?
Import, Import-all-tables are the import control commands.
In Sqoop, large objects are managed by importing them into a file known
as "LobFile" which is short for a Large Object File. These LobFiles have
the capability to store large sized data records.
Q4. How can we import data from particular row or column? What is the