Final Bda 1-8 Lab Aayush
Final Bda 1-8 Lab Aayush
Aim: To study Basic commands available for the Hadoop Distributed File System
HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you need
to start the Hadoop services using the following command:
start-all.sh
stop-all.sh
hadoop version
The Hadoop fs shell command version prints the Hadoop version.
Jps
To check the Hadoop services are up and running use the following command:
Name:-Aayush Vaghela 15
En. No:-2203051057146
hadoop fs -ls
It will print all the directories present in HDFS. bin directory contains executables so,
mkdir:
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create
it.
hadoop dfs -mkdir bdalab
vi lab.txt
cat lab.txt
creating local file and viewing the content.
put
To copy files/folders from local file system to hdfs store. This is the most important command.
Local filesystem means the files present on the OS.
syntax
haoop fs -put <localsrc> <dest>
http://localhost:50070/
Name:-Aayush Vaghela 16
En. No:-2203051057146
to check the file copied to Hadoop file system or not in the graphical user interface.
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Name:-Aayush Vaghela 17
En. No:-2203051057146
Syntax:
cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
Name:-Aayush Vaghela 18
En. No:-2203051057146
mv: This command is used to move files within hdfs.
Syntax:
Syntax:
Hadoop fs -rmr /directory -> It will delete all the content inside the directory then the
directory itself.
stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
setrep: This command is used to change the replication factor of a file/directory in HDFS. By
default, it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
Hadoop fs -setrep -R -w 6 test
Note: -R means recursively, we use it for directories as they may also contain many files and
folders inside them.
test
The test command is used for file test operations.
Options Description
Check whether the path given by the user is a directory or not, return 0 if it is a
-d
directory.
-e Check whether the path given by the user exists or not, return 0 if the path exists.
-f Check whether the path given by the user is a file or not, return 0 if it is a file.
-s Check if the path is not empty, return 0 if a path is not empty.
-r return 0 if the path exists and read permission is granted
-w return 0 if the path exists and write permission is granted
Name:-Aayush Vaghela 20
En. No:-2203051057146
-z Checks whether the file size is 0 byte or not, return 0 if the file is of 0 bytes.
Example
getmerge
getmerge command merges a list of files in a directory on the HDFS filesystem into a single
local file on the local filesystem.
Example
stat prints the statistics about the file or directory in the specified format.
Formats:
Example
Name:-Aayush Vaghela 21
En. No:-2203051057146
Name:-Aayush Vaghela 22
En. No:-2203051057146
PRACTICAL 6
Aim: To study basic commands available for HIVE Query Language.
Description:
Apache Hive is an open-source data warehousing tool for performing distributed processing
and data analysis. It was developed by Facebook to reduce the work of writing the Java
MapReduce program. Apache Hive uses a Hive Query language, which is a declarative
language similar to SQL. Hive translates the hive queries into MapReduce programs. It
supports developers to perform processing and analyses on structured and semi-structured data
by replacing complex java MapReduce programs with hive queries. One who is familiar with
SQL commands can easily write the hive queries.
Hive supports applications written in any language like Python, Java, C++, Ruby, etc. using
JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily
write a hive client application in any language of its own choice.
Hive clients are categorized into three types:
1. Thrift Clients
The Hive server is based on Apache Thrift so that it can serve the request from a thrift client.
2. JDBC client
Hive allows for the Java applications to connect to it using the JDBC driver. JDBC driver uses
Thrift to communicate with the Hive Server.
3. ODBC client
Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive.
Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive Server.
Name:-Aayush Vaghela 23
En. No:-2203051057146
Hive - Create Database
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
Initially, we check the default database provided by Hive. So, to check the list of existing
databases, follow the below command: -
hive> show databases;
Internal table
The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes both
table schema and data
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
External Table
The external table allows us to create and access a table and a data externally. The external
keyword is used to specify the external table, whereas the location keyword is used to
determine the location of loaded data. As the table is external, the data is not present in the
Hive directory. Therefore, if we try to drop the table, the metadata of the table will be deleted,
but the data still exists.
hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Name:-Aayush Vaghela 24
En. No:-2203051057146
Hive - Load Data
Once the internal table has been created, the next step is to load the data into it. So, in Hive, we
can easily load data from any file to the database.
Name:-Aayush Vaghela 25
En. No:-2203051057146
PRACTICAL 7
Description:
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable. HBase is a data model that is similar to
Google’s big table designed to provide quick random access to huge amounts of structured
data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).It is a part of
the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop
File System. One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access.
Data Definition Language :
1. create
2.list
list
3.disable
disable 'emp'
4.is_disabled
is_disabled 'emp'
5.enable
enable 'emp'
6.is_enabled
is_enabled 'emp'
7.describe
describe 'emp'
8.drop
Name:-Aayush Vaghela 26
En. No:-2203051057146
drop 'emp'
9.put :
10.get
11.delete
12.deleteall
deleteall 'emp','1'
13.scan
scan 'emp'
14.count
count 'emp'
15.truncate
truncate 'emp'
Name:-Aayush Vaghela 27
En. No:-2203051057146
PRACTICAL 8
Aim: Creating the HDFS tables and loading them in Hive and learn join, partition of
tables in Hive.
Description:
Partitions
Each table can be broken into partitions, Partitions determine distribution of data within
subdirectories. In the current century, we know that the huge amount of data which is in the
range of petabytes is getting stored in HDFS. So due to this, it becomes very difficult for
Hadoop users to query this huge amount of data.
The Hive was introduced to lower down this burden of data querying. Apache Hive converts
the SQL queries into MapReduce jobs and then submits it to the Hadoop cluster. When we
submit a SQL query, Hive read the entire data-set. So, it becomes inefficient to run MapReduce
jobs over a large table. Thus this is resolved by creating partitions in tables. Apache Hive
makes this job of implementing partitions very easy by creating partitions by its automatic
partition scheme at the time of table creation.
In Partitioning method, all the table data is divided into multiple partitions. Each partition
corresponds to a specific value(s) of partition column(s). It is kept as a sub-record inside the
table’s record present in the HDFS. Therefore on querying a particular table, appropriate
partition of the table is queried which contains the query value. Thus this decreases the I/O time
required by the query. Hence increases the performance speed.
Static partitions
Insert input data files individually into a partition table is Static Partition. Usually when loading
files (big files) into Hive tables static partitions are preferred. Static Partition saves your time in
Name:-Aayush Vaghela 28
En. No:-2203051057146
loading data compared to dynamic partition. You “statically” add a partition in the table and
move the file into the partition of the table. We can alter the partition in the static partition. You
can get the partition column value from the filename, day of date etc without reading the whole
big file. If you want to use the Static partition in the hive you should set property set
hive.mapred.mode = strict This property set by default in hive-site.xml.Static partition is in
Strict Mode. You should use where clause to use limit in the static partition. You can perform
Static partition on Hive Manage table or external table.
Dynamic partitions
Single insert to partition table is known as a dynamic partition. Usually, dynamic partition
loads the data from the non-partitioned table. Dynamic Partition takes more time in loading
data compared to static partition. When you have large data stored in a table then the Dynamic
partition is suitable. If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable. Dynamic partition there is no required
where clause to use limit. We can’t perform alter on the Dynamic partition. You can perform
dynamic partition on hive external table and managed table. If you want to use the Dynamic
partition in the hive then the mode is in non-strict mode.Here are Hive dynamic partition
properties you should allow
use test;
drop database test
show tables;
drop table student;
show databases;
Dynamic partitioning
Note: By default dynamic partioning will be disabled. We need to enable it using the followng
command:
7. set hive.exec.dynamic.partition=true;
8. set hive.exec.dynamic.partition.mode=nonstrict;
9. create table stu(name string, rollno int, percentage float, state string, city string) row format
Name:-Aayush Vaghela 29
En. No:-2203051057146
delimited fields terminated by ',';
11. create table stud_part (name string, rollno int, percentage float)
partitioned by (state string, city string)
row format delimited
fields terminated by ',';
Karnataka.txt
Rajesh,100,78
Abhishek,95,76
Manish,102,89
siva,203,66
sania,204,77
Maharastra.txt
ravi,100,56
mohan,95,89
mahesh,102,67
janvi,103,66
Hive Join
Let's see two tables Employee and Employee Department that are going to be joined.
Name:-Aayush Vaghela 30
En. No:-2203051057146
Next →← Prev
Hive Join
Let's see two tables Employee and EmployeeDepartment that are going to be joined.
Name:-Aayush Vaghela 31
En. No:-2203051057146