Assignment
Assignment
Assignment 05
Q1) What is apache flume? State and explain features of apache flume.
Ans:-
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data from various sources to a
centralized data store, like HDFS (Hadoop Distributed File System). Flume is mainly
used for log data collection in Hadoop environments but can also handle streaming
data.
Ans:-
Flume Event-
An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the
destination accompanied by optional headers. A typical Flume event would
have the following structure −
Flume Agent-
An agent is an independent daemon process (JVM) in Flume. It receives the data
(events) from clients or other agents and forwards it to its next destination (sink or
agent). Flume may have more than one agent. Following diagram represents a Flume
Agent
As shown in the diagram a Flume Agent contains three main components namely,
source, channel, and sink.
Source
● A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
● Apache Flume supports several types of sources and each source receives
events from a specified data generator.
● Example − Avro source, Thrift source, twitter 1% source etc.
Channel
● A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks.
● It acts as a bridge between the sources and the sinks.
● These channels are fully transactional and they can work with any number of
sources and sinks.
● Example − JDBC channel, File system channel, Memory channel, etc.
Sink
● A sink stores the data into centralized stores like HBase and HDFS.
● It consumes the data (events) from the channels and delivers it to the
destination.
● The destination of the sink might be another agent or the central stores.
● Example − HDFS sink
Ans:
As shown in the figure, there are various components in the Apache Pig framework. Let
us take a look at the major components.
Parser
● Initially the Pig Scripts are handled by the Parser.
● It checks the syntax of the script, does type checking, and other miscellaneous
checks.
● The output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.
● In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
● Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
● Finally, these MapReduce jobs are executed on Hadoop producing the desired
results.
Q4) Explain in detail working of apache flume , also Describe flows in apache
flume.
Or
Describe in detail fanin and fanout flow.
Ans:
Working of Apache Flume:
1. Fan-in Flow:
● In a fan-in flow, multiple sources send data to a single sink through one or more
channels.
● This configuration is useful when you have many sources collecting data from
different locations, and you want to aggregate this data to a central repository.
● Use Case: Aggregating logs from multiple servers and sending them to a single
HDFS directory.
2. Fan-out Flow:
● In a fan-out flow, one source sends data to multiple sinks via different channels.
● This is useful when the same data needs to be sent to multiple destinations, for
example, writing one copy of the data to HDFS for long-term storage and another
to HBase for real-time processing.
● Use Case: Storing logs in both HDFS for batch processing and HBase for real-
time querying.
Ans:
Channel
● A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks.
● It acts as a bridge between the sources and the sinks.
● These channels are fully transactional and they can work with any number of
sources and sinks.
● Example − JDBC channel, File system channel, Memory channel, etc.
Sink
● A sink stores the data into centralized stores like HBase and HDFS.
● It consumes the data (events) from the channels and delivers it to the
destination.
● The destination of the sink might be another agent or the central stores.
● Example − HDFS sink
Channel Selector in Detail:
Ans:
● Flume ensures reliable data transmission between the source and sink by using
channels (like File Channel) that offer durability.
● In case of any failure (such as network issues, sink failure, or node crashes),
Flume can buffer data in channels and retry delivery, ensuring fault tolerance.
● This makes sure that no data is lost during transmission or during failures.
● Apache Flume supports a wide range of data sources such as log files, Avro,
syslogs, HTTP streams, or custom-built data sources.
● It also supports a wide variety of sinks such as HDFS (Hadoop Distributed File
System), HBase, Elasticsearch, or even custom sinks.
● This flexibility makes it a versatile tool for different use cases and environments.
5. Event-driven Model:
6. Data Aggregation:
● Apache Flume can handle data aggregation from various sources into a single
destination (e.g., aggregating logs from multiple servers into a central HDFS
directory).
● It can also handle fan-out scenarios where data from a single source is sent to
multiple sinks.
Q7) Explain the role of apache Sqoop. Also state key features of apache Sqoop.
Ans:
● A component of the Hadoop ecosystem is Apache Sqoop.
● There was a need for a specialized tool to perform this process quickly because
a lot of data needed to be moved from relational database systems onto Hadoop.
● This is when Apache Sqoop entered the scene and is now widely used for
moving data from RDBMS files to the Hadoop ecosystem for MapReduce
processing and other uses.
● Data must first be fed into Hadoop clusters from various sources to be processed
using Hadoop. However, it turned out that loading data from several
heterogeneous sources was a challenging task.
● The issues that administrators ran with included:
● Keeping data consistent
● Ensuring effective resource management
● Sqoop uses the YARN framework to import and export data. Parallelism is
enhanced by fault tolerance in this way.
● We may import the outcomes of a SQL query into HDFS using Sqoop.
● For several RDBMSs, including MySQL and Microsoft SQL servers, Sqoop offers
connectors.
● Sqoop supports the Kerberos computer network authentication protocol, allowing
nodes to authenticate users while securely communicating across an unsafe
network.
● Sqoop can load the full table or specific sections with a single command.
Ans:
1. Sqoop Client:
○ The Sqoop Client is the command-line interface (CLI) that allows users to
interact with Sqoop.
○ Users submit import/export commands via the Sqoop client, specifying
details such as database connection, tables, HDFS, Hive, or HBase
locations, and other configurations.
○ The client triggers the appropriate Sqoop job based on the user’s input.
2. Connector:
○ Connectors are the components that allow Sqoop to communicate with
different relational databases (such as MySQL, Oracle, PostgreSQL, SQL
Server, etc.).
○ Sqoop uses JDBC or ODBC drivers to connect to databases and execute
SQL queries to read or write data.
○ Some databases may have specialized connectors to optimize
performance for that specific database.
3. Mapper (MapReduce Framework):
○ The core of Sqoop's data transfer mechanism is based on Hadoop’s
MapReduce framework.
○ During an import or export process, Sqoop breaks down the task into
multiple parallel mappers, which allow for faster processing of large
datasets by reading or writing data in chunks.
○ Each mapper reads or writes a portion of the data, allowing Sqoop to
handle very large datasets efficiently.
○ Sqoop does not use the reduce phase in MapReduce since it is mainly
focused on transferring data, not aggregation.
4. Sqoop Job:
○ A Sqoop Job is a logical unit of work in Sqoop, consisting of the mappers
and the necessary configurations for transferring data between the
database and Hadoop.
○ The job handles the data transfer process and can be scheduled to run at
specific intervals using Hadoop schedulers like Oozie.
5. HDFS / Hive / HBase:
○ Sqoop imports data from a relational database directly into HDFS, where it
can be stored for further processing or analysis.
○ In addition to HDFS, Sqoop can also import data into Hive tables (for
query processing) or into HBase (for NoSQL processing).
○ For export operations, Sqoop can take processed data from HDFS and
export it back into the relational database.
6. Relational Database (RDBMS):
○ The source or destination for the data in Sqoop's architecture is typically a
relational database, such as MySQL, Oracle, SQL Server, PostgreSQL, or
others.
○ Sqoop interacts with these databases to either import data into Hadoop or
export it from Hadoop back to the RDBMS.
Ans:
Sqoop Import :
● The procedure is carried out with the aid of the sqoop import command.
● We can import a table from the Relational database management system to the
Hadoop database server with the aid of the import command.
● Each record loaded into the Hadoop database server as a single record is kept in
text files as part of the Hadoop framework.
● While importing data, we may also load and split Hive.
● Sqoop also enables the incremental import of data, which means that if we have
already imported a database and want to add a few more rows, we can only do it
with the aid of these functions, not the entire database.
Sqoop Export:
● The Sqoop export command facilitates the execution of the task with the aid of
the export command, which performs operations in reverse.
● Here, we can transfer data from the Hadoop database file system to the
relational database management system with the aid of the export command.
● Before the operation is finished, the data that will be exported is converted into
records.
● Two processes are involved in the export of data: the first is searching the
database for metadata, and the second is moving the data.
Ans:
Ans:
Ans:
● HiveQL (Hive Query Language) is a SQL-like language used for querying and
managing large datasets in Hadoop. HiveQL is designed to be familiar to users
with SQL experience.
● It allows users to perform SQL operations such as SELECT, JOIN, GROUP BY,
ORDER BY, etc., making it easy for business analysts to query big data without
needing deep programming knowledge.
3. Schema on Read:
4. Scalability:
● Hive is highly scalable and can handle petabytes of data. It leverages the
Hadoop MapReduce framework for parallel processing of large datasets across
multiple nodes in a cluster.
● Hive can also use other execution engines like Tez and Apache Spark for faster
query execution.
● Partitioning helps in improving query performance by dividing the table data into
smaller pieces based on certain column values (e.g., date, region). Queries can
skip entire partitions if they are not relevant, reducing the data to scan.
● Bucketing further subdivides the partitions into more manageable parts (called
buckets), allowing for faster joins and query processing by evenly distributing
data across the buckets.
Ans:
I. Column Types
Column type are used as column data types of Hive. They are as follows:
● Integral Types
Integer type data can be specified using integral data types, INT. When the data
range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
● String Types
String type data types can be specified using single quotes (' ') or double quotes
(" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
● Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and
format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
● Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.
● Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision.
● Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union.
II. Literals
● Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.
● Decimal Type
Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.
IV.Complex Types
● Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
● Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
● Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Ans:
3. Query Language
● Hive: Uses HiveQL, a SQL-like query language that is more suitable for batch
processing and analytical tasks. While it resembles SQL, it lacks certain
functionalities and optimizations found in traditional SQL.
● SQL Databases: Use standard SQL, which is well-defined and optimized for
transaction processing. SQL is designed for rapid and efficient retrieval of
records, especially for small datasets.
4. Execution Model
● Hive: Primarily runs on MapReduce, which is designed for batch processing and
can be slower for interactive queries. Hive has improved execution with support
for Apache Tez and Apache Spark, but it is still primarily optimized for batch
workloads.
● SQL Databases: Generally use a transaction-oriented execution model that
supports real-time querying and updates. They are optimized for low-latency and
interactive query performance.
● Hive: Supports complex data types like arrays, maps, and structs, allowing for
more flexible data modeling. It is particularly useful for semi-structured data.
● SQL Databases: Typically focus on structured data with fixed schemas. While
some SQL databases support JSON or XML data types, they often lack the
flexibility of Hive’s complex types.
● Hive: Designed for batch processing of large datasets rather than real-time
transactions. Although Hive has implemented optimizations such as partitioning
and bucketing, it is not as performant as traditional SQL databases for
transactional workloads.
● SQL Databases: Optimized for performance with features like indexes, caching,
and query optimization, allowing for rapid retrieval and updates of data in
transaction-heavy applications.
Ans:
● The UI is the front-end layer that allows users to interact with Hive. It provides
different interfaces like the CLI (Command Line Interface), Hive Web Interface,
and Thrift Service. Users submit HiveQL queries via these interfaces, which are
then passed to the Driver for processing.
Driver:
Compiler:
● The Compiler takes the parsed query and converts it into a directed acyclic graph
(DAG) of MapReduce, Tez, or Spark jobs.
● It optimizes the logical plan and creates physical execution plans based on
available resources and the underlying execution engine (MapReduce, Tez, or
Spark).
Optimizer:
Execution Engine:
● The Execution Engine executes the physical plan generated by the compiler. It
coordinates the distributed execution of jobs across the Hadoop cluster.
● Hive can use multiple execution engines, including:
○ MapReduce (default)
○ Tez (for better performance)
○ Spark (for even faster execution in memory).
Metastore:
● The Hadoop Distributed File System (HDFS) serves as the storage layer for
Hive. All the actual data resides in HDFS, while Hive provides the interface to
query and manage this data.
● Hive supports a wide variety of data formats, including Text, ORC, Parquet,
Sequence File, and Avro.
Ans:
1. Data Serialization
● Avro provides a compact binary format for data serialization, which means it
efficiently encodes data for storage or transmission. This compactness leads to
reduced disk space usage and improved network performance.
2. Schema-Based
● Avro uses a schema to define the structure of data. The schema is written in
JSON format, allowing users to define fields, data types, and their relationships.
The schema serves as a contract between data producers and consumers,
ensuring compatibility and consistency.
3. Dynamic Typing
4. Language Independence
● Avro supports complex data types such as arrays, maps, records, and unions.
This allows users to model complex data structures that are common in big data
applications, enabling more expressive and organized data representations.
6. Schema Evolution
● Avro allows for schema evolution, meaning you can change the schema over
time without breaking compatibility with existing data. Avro supports forward and
backward compatibility, enabling producers and consumers to read and write
data using different schema versions.
7. Self-Describing Data
● Avro data files include the schema with the data, making it self-describing. This
feature eliminates the need for external metadata, simplifying data interchange
and making it easier for applications to understand the data structure.
Q17) What are the various data type used in apache avro?
Ans:
1. Primitive Data Types
Ans:
Ans:
Serialization is the process of converting an object or data structure into a format that
can be easily stored (in files, databases, etc.) or transmitted (over networks). In the
context of Apache Avro, serialization refers to the transformation of data into a binary
format (or JSON format) based on a defined schema.
1. Schema-Based Serialization:
○ Avro uses a schema to define the structure of the data. The schema
specifies the data types and organization, ensuring that data is serialized
consistently.
○ The schema is written in JSON format and acts as a contract between the
data producer and consumer.
2. Compact Binary Format:
Ans:
Advantages Of Apache HBase:
● Scalability: HBase can handle extremely large datasets that can be distributed
across a cluster of machines. It is designed to scale horizontally by adding more
nodes to the cluster, which allows it to handle increasingly larger amounts of
data.
● High-performance: HBase is optimized for low-latency, high-throughput access to
data. It uses a distributed architecture that allows it to process large amounts of
data in parallel, which can result in faster query response times.
● Flexible data model: HBase’s column-oriented data model allows for flexible
schema design and supports sparse datasets. This can make it easier to work
with data that has a variable or evolving schema.
● Fault tolerance: HBase is designed to be fault-tolerant by replicating data across
multiple nodes in the cluster. This helps ensure that data is not lost in the event
of a hardware or network failure.
Ans:
The Data Model in HBase is made of different logical components such as Tables,
Rows, Column Families, Columns, Cells and Versions.
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions. As shown above, every Region is then served by exactly one
Region Server. The figure above shows a representation of a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
are unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the basic unit
of physical storage to which certain HBase features like compression are applied.
Hence it’s important that proper care be taken when designing Column Families in
table.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column
Family and the Column (Column Qualifier). The data stored in a Cell is called its value
and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by
the timestamp. The number of versions of data retained in a column family is
configurable and this value by default is 3.
Ans:
1. Client Request:
○ A client sends a write request to an HBase region server. The write
request usually consists of the row key, column family, column qualifier,
and the value to be written.
2. MemStore:
○ The region server first writes the data to an in-memory structure called the
MemStore. Each column family has its own MemStore.
○ The MemStore is a write-ahead log (WAL) that temporarily stores write
operations. It allows for fast write performance since it involves writing to
memory rather than directly to disk.
3. Write-Ahead Log (WAL):
○ Simultaneously, the write request is recorded in the WAL, which is stored
on disk. This ensures durability and allows for recovery in case of failures.
The read process in HBase is designed to ensure fast and efficient data retrieval:
1. Client Request:
○ A client sends a read request to a region server specifying the row key for
the desired data.
2. MemStore Check:
○ The region server first checks the MemStore for the requested row. If the
data is found in the MemStore, it is returned directly to the client, providing
a fast response.
3. HFile Lookup:
○ If the data is not found in the MemStore, the region server looks for the
data in the HFiles stored in HDFS.
○ The region server performs a lookup based on the row key in the HFiles. It
reads the HFiles to retrieve the requested data.
4. Bloom Filters:
○ HBase uses Bloom filters to optimize read operations. Bloom filters are
probabilistic data structures that help determine if a row key is likely to
exist in a specific HFile. This helps reduce unnecessary disk reads.
○ If the Bloom filter indicates that the row key is not present in the HFile, the
lookup is skipped, saving time and resources.
5. Return Data:
○ Once the data is found (either in the MemStore or HFiles), it is returned to
the client.
6. Caching:
○ HBase caches frequently accessed data to speed up future read
operations. The caching mechanism stores recently read rows in memory,
allowing for faster access during subsequent requests.
Ans:
HMaster
● HMaster in HBase is the HBase architecture’s implementation of a Master server.
● It serves as a monitoring agent for all Region Server instances in the cluster
along with as an interface for any metadata updates.
● HMaster runs on NameNode in a distributed cluster context.
Region Server
● The Region Server, also known as HRegionServer, is in charge of managing and
providing certain data areas in HBase.
● A region is a portion of a table's data that consists of numerous contiguous rows
ordered by the row key.
● Each Region Server is in charge of one or more regions that the HMaster
dynamically assigns to it.
● The Region Server serves as an intermediary for clients sending write or read
requests to HBase, directing them to the appropriate region based on the
requested column family.
● Clients can connect with the Region Server without requiring HMaster
authorization, allowing for efficient and direct access to HBase data.
Zookeeper
ZooKeeper is a centralized service in HBase that maintains configuration information,
provides distributed synchronization, and provides naming and grouping functions.
Ans:
Ans:
● Zookeeper is a distributed, open-source coordination service for distributed
applications.
● It exposes a simple set of primitives to implement higher-level services for
synchronization, configuration maintenance, and group and naming.
Features of ZooKeeper
1. Centralized Coordination:
○ ZooKeeper acts as a centralized service that coordinates distributed
applications. It helps in maintaining the state of different components of
distributed systems and ensures they work together seamlessly.
2. High Availability:
○ ZooKeeper is designed to be highly available and fault-tolerant. It uses a
quorum-based mechanism to ensure that even if some nodes fail, the
service can still function correctly. This redundancy allows ZooKeeper to
maintain availability in case of node failures.
3. Consistency:
○ ZooKeeper provides strong consistency guarantees. All clients see the
same view of the data at any given time, ensuring that updates are atomic
and sequentially consistent.
4. Hierarchical Namespace:
○ ZooKeeper uses a hierarchical tree structure to store its data. This
structure allows clients to organize their data logically, similar to a file
system with directories and files.
5. Watchers:
○ Clients can register "watchers" on specific nodes in ZooKeeper. When the
data or state of a node changes, ZooKeeper notifies the registered clients,
enabling them to react to changes in real-time.
Ans:
1. Configuration Management
2. Synchronization
3. Leader Election
4. Group Membership
5. Naming Service
6. Data Storage
Ans:
Q28) What is data analytics. State and explain any 5 analytical tools with features.
Ans:
Data analytics is the process of examining, cleaning, transforming, and modeling data to
discover useful information, draw conclusions, and support decision-making. It involves
using statistical methods and software tools to analyze data sets, identify patterns,
trends, and relationships, and generate insights that can guide business strategies,
improve operations, and enhance performance.
2. Spark
● Overview: Apache Spark is an open-source unified analytics engine for big data
processing, with built-in modules for streaming, SQL, machine learning, and
graph processing.
● Key Features:
○ In-Memory Processing: Provides faster data processing by storing data
in memory, reducing I/O operations.
○ Ease of Use: Offers APIs in Java, Scala, Python, and R, making it
accessible to a wide range of developers.
○ Versatile: Supports various data sources, including HDFS, Cassandra,
HBase, and S3.
○ Real-Time Processing: Capable of processing data in real-time with
Spark Streaming.
○ Machine Learning: Built-in library (MLlib) for machine learning tasks,
providing algorithms and utilities.
3. MongoDB
● Key Features:
○ Schema Flexibility: Supports dynamic schemas, allowing for the storage
of unstructured and semi-structured data.
○ Scalability: Easily scales horizontally through sharding, distributing data
across multiple servers.
○ High Performance: Offers fast read and write operations due to its in-
memory capabilities and indexing options.
○ Rich Query Language: Provides a powerful query language for retrieving
and manipulating data.
○ Aggregation Framework: Supports complex data processing and
transformation operations using the aggregation pipeline.
4. Cassandra
Ans:
1. Descriptive Analytics
Key Features:
● Data Summary: Uses measures such as mean, median, mode, and standard
deviation to summarize data.
● Reporting: Generates reports and dashboards to present historical data visually
(e.g., graphs, charts).
● Trend Analysis: Identifies patterns over time, helping businesses understand
how performance metrics have changed.
● Use Cases: Commonly used in business intelligence for sales reports, customer
behavior analysis, and financial performance reviews.
Examples:
● A company analyzing sales data from the past year to determine peak sales
months.
● A healthcare provider reviewing patient data to identify trends in patient
admissions over time.
2. Predictive Analytics
Overview: Predictive analytics uses statistical models and machine learning techniques
to forecast future outcomes based on historical data. It aims to identify trends and
patterns that can help anticipate future events.
Key Features:
Examples:
● A retail company using predictive analytics to forecast future sales based on past
purchasing behavior.
● An insurance company predicting the likelihood of claims based on customer
demographics and historical claims data.
3. Prescriptive Analytics
Key Features:
Examples:
Ans: