UNITS-5 (1)
UNITS-5 (1)
1. Apache Pig
Introduction to Pig:
Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. Developed
by Yahoo!, Pig simplifies the processing of large datasets using its scripting language called Pig Latin.
It is primarily used for ETL (Extract, Transform, Load) operations, offering a flexible, data-flow
approach to handling complex data processing tasks.
• Local Mode: Executes Pig scripts on a single JVM, useful for development and testing.
• MapReduce Mode: Runs Pig scripts on a Hadoop cluster, enabling distributed processing.
This is suitable for large-scale data tasks.
Grunt Shell:
Grunt is the interactive shell for Pig. It allows users to execute Pig Latin commands interactively. This
is especially useful for testing small data samples, exploring data structures, or debugging complex
data pipelines.
Pig Latin:
Pig Latin is the scripting language used in Pig. It is a data flow language that provides a series of
transformations on data. Key commands include:
2. Apache Hive
Hive was developed by Facebook to bring SQL-like querying capability to Hadoop. It uses HiveQL, a
declarative query language similar to SQL, making it accessible to users familiar with traditional
databases. Hive is ideal for data warehousing tasks, transforming and querying structured datasets
stored in HDFS.
Hive converts HiveQL statements into MapReduce or Tez/Spark jobs under the hood. It supports
operations such as SELECT, JOIN, GROUP BY, and aggregates. Hive is best suited for batch processing
rather than real-time querying, making it useful for business intelligence and reporting tasks.
3. Apache HBase
HBase is a distributed, column-oriented NoSQL database built on top of HDFS. Inspired by Google’s
Bigtable, HBase provides real-time, random read/write access to large datasets. Unlike Hive and Pig,
which are batch-oriented, HBase is optimized for low-latency operations.
HBase stores data in tables with rows and column families. Each cell can contain multiple versions,
indexed by timestamps. It supports horizontal scaling and is ideal for applications like messaging
platforms, sensor data capture, and financial transactions.
Conclusion
Pig, Hive, and HBase together empower Hadoop to handle a wide spectrum of big data needs—from
batch ETL to SQL-style analytics and real-time data access. Pig simplifies complex data flows, Hive
offers structured querying, and HBase delivers high-speed random access—each tool playing a