0% found this document useful (0 votes)
40 views37 pages

Assignment

Uploaded by

Zhong Xina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views37 pages

Assignment

Uploaded by

Zhong Xina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Name:- Tisha Sachin Shah

Roll no:- 31031523031

Assignment 05
Q1) What is apache flume? State and explain features of apache flume.

Ans:-
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data from various sources to a
centralized data store, like HDFS (Hadoop Distributed File System). Flume is mainly
used for log data collection in Hadoop environments but can also handle streaming
data.

Features of Apache Flume:


● Reliability: Flume supports a reliable data delivery mechanism where data is
buffered in channels and retried if delivery fails.
● Scalability: It is highly scalable, capable of handling large amounts of log data
with multiple agents working simultaneously.
● Fault Tolerance: Flume ensures data recovery in case of node or network
failures.
● Multiple Sources and Destinations: Supports a variety of sources like syslogs,
files, and databases, and destinations like HDFS, HBase, or other sinks.
● Customizable: Users can implement their custom logic for data ingestion via
customizable sources, channels, and sinks.
● Distributed Architecture: Flume agents are distributed across different nodes,
providing a robust system for real-time data ingestion.
● Event-driven Model: Flume works on an event-driven model where each event
(log entry) is treated as a unit of data.

Q2) Describe architecture of apache flume along with diagram

Ans:-
Flume Event-
An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the
destination accompanied by optional headers. A typical Flume event would
have the following structure −

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Flume Agent-
An agent is an independent daemon process (JVM) in Flume. It receives the data
(events) from clients or other agents and forwards it to its next destination (sink or
agent). Flume may have more than one agent. Following diagram represents a Flume
Agent

As shown in the diagram a Flume Agent contains three main components namely,
source, channel, and sink.

Source
● A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
● Apache Flume supports several types of sources and each source receives
events from a specified data generator.
● Example − Avro source, Thrift source, twitter 1% source etc.

Channel
● A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks.
● It acts as a bridge between the sources and the sinks.
● These channels are fully transactional and they can work with any number of
sources and sinks.
● Example − JDBC channel, File system channel, Memory channel, etc.

Sink
● A sink stores the data into centralized stores like HBase and HDFS.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● It consumes the data (events) from the channels and delivers it to the
destination.
● The destination of the sink might be another agent or the central stores.
● Example − HDFS sink

Q3) Describe architecture of apache pig along with diagram.

Ans:

As shown in the figure, there are various components in the Apache Pig framework. Let
us take a look at the major components.

Parser
● Initially the Pig Scripts are handled by the Parser.
● It checks the syntax of the script, does type checking, and other miscellaneous
checks.
● The output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.
● In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine
● Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
● Finally, these MapReduce jobs are executed on Hadoop producing the desired
results.

Q4) Explain in detail working of apache flume , also Describe flows in apache
flume.
Or
Describe in detail fanin and fanout flow.

Ans:
Working of Apache Flume:

● Flume is a framework which is used to move log data into HDFS.


● Generally events and log data are generated by the log servers and these
servers have Flume agents running on them.
● These agents receive the data from the data generators.
● The data in these agents will be collected by an intermediate node known as
Collector.
● Just like agents, there can be multiple collectors in Flume.
● Finally, the data from all these collectors will be aggregated and pushed to a
centralized store such as HBase or HDFS.
● The following diagram explains the data flow in Flume.

Flows in Apache Flume:


Flume can be configured in different flow patterns, depending on how you want to
handle the movement of data. The two common flow patterns are Fan-in and Fan-out.

1. Fan-in Flow:
● In a fan-in flow, multiple sources send data to a single sink through one or more
channels.
● This configuration is useful when you have many sources collecting data from
different locations, and you want to aggregate this data to a central repository.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Use Case: Aggregating logs from multiple servers and sending them to a single
HDFS directory.

2. Fan-out Flow:
● In a fan-out flow, one source sends data to multiple sinks via different channels.
● This is useful when the same data needs to be sent to multiple destinations, for
example, writing one copy of the data to HDFS for long-term storage and another
to HBase for real-time processing.
● Use Case: Storing logs in both HDFS for batch processing and HBase for real-
time querying.

Q5) State different components of apache flume architecture. Describe channel


selector in detail.

Ans:

Components of Apache Flume Architecture:


Source
● A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
● Apache Flume supports several types of sources and each source receives
events from a specified data generator.
● Example − Avro source, Thrift source, twitter 1% source etc.

Channel
● A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks.
● It acts as a bridge between the sources and the sinks.
● These channels are fully transactional and they can work with any number of
sources and sinks.
● Example − JDBC channel, File system channel, Memory channel, etc.

Sink
● A sink stores the data into centralized stores like HBase and HDFS.
● It consumes the data (events) from the channels and delivers it to the
destination.
● The destination of the sink might be another agent or the central stores.
● Example − HDFS sink
Channel Selector in Detail:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

A Channel Selector is a component within Apache Flume that is responsible for


selecting the appropriate channel(s) for routing the events coming from a source. This is
especially important when multiple channels are connected to the same source, as the
channel selector determines which channel will receive which events.

There are two main types of channel selectors in Flume:

1. Replicating Channel Selector:


● This is the default channel selector in Flume.
● The Replicating Selector sends a copy of each event to all the channels
connected to the source.
● This ensures that all channels receive identical events, which is useful when you
need to send the same data to multiple destinations (e.g., writing data to both
HDFS and HBase simultaneously).

2. Multiplexing Channel Selector:


● The Multiplexing Selector routes events to specific channels based on the
content of the event header.
● This selector allows more fine-grained control over which events are sent to
which channel.

Q6) State and explain key advantages of apache flume.

Ans:

1. Distributed and Scalable System:

● Apache Flume is designed as a distributed architecture, meaning it can easily


scale horizontally. You can deploy multiple Flume agents across several
machines to collect and process data from a wide range of sources.
● It supports handling large volumes of data in a distributed environment, which is
essential when dealing with big data applications like Hadoop.

2. Reliability and Fault Tolerance:

● Flume ensures reliable data transmission between the source and sink by using
channels (like File Channel) that offer durability.
● In case of any failure (such as network issues, sink failure, or node crashes),
Flume can buffer data in channels and retry delivery, ensuring fault tolerance.
● This makes sure that no data is lost during transmission or during failures.

3. Multiple Source and Sink Support:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Apache Flume supports a wide range of data sources such as log files, Avro,
syslogs, HTTP streams, or custom-built data sources.
● It also supports a wide variety of sinks such as HDFS (Hadoop Distributed File
System), HBase, Elasticsearch, or even custom sinks.
● This flexibility makes it a versatile tool for different use cases and environments.

4. Extensible and Customizable:

● Flume is highly extensible, allowing developers to create custom sources, sinks,


and channels to fit specific use cases.
● You can also add interceptors to perform data filtering, enrichment, or
transformation before sending the data to a sink.
● This level of customization makes Flume adaptable to different types of data
pipelines.

5. Event-driven Model:

● Flume operates on an event-driven model, which is highly efficient for real-time


data ingestion and processing.
● Each unit of data, called an event, moves through the pipeline from source to
sink, providing continuous data flow without interruption.

6. Data Aggregation:

● Apache Flume can handle data aggregation from various sources into a single
destination (e.g., aggregating logs from multiple servers into a central HDFS
directory).
● It can also handle fan-out scenarios where data from a single source is sent to
multiple sinks.

Q7) Explain the role of apache Sqoop. Also state key features of apache Sqoop.

Ans:
● A component of the Hadoop ecosystem is Apache Sqoop.
● There was a need for a specialized tool to perform this process quickly because
a lot of data needed to be moved from relational database systems onto Hadoop.
● This is when Apache Sqoop entered the scene and is now widely used for
moving data from RDBMS files to the Hadoop ecosystem for MapReduce
processing and other uses.
● Data must first be fed into Hadoop clusters from various sources to be processed
using Hadoop. However, it turned out that loading data from several
heterogeneous sources was a challenging task.
● The issues that administrators ran with included:
● Keeping data consistent
● Ensuring effective resource management

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Bulk data loading into Hadoop was not possible.


● Data loading with scripts was sluggish.
● The data stored in external relational databases cannot be accessed directly by
the MapReduce application.
● This approach puts the system in danger of having cluster nodes generate too
much stress. Sqoop was the answer.
● The difficulties of the conventional method were completely overcome by using
Sqoop in Hadoop, which also made it simple to load large amounts of data from
RDBMS into Hadoop.
● Most of the procedure is automated by Sqoop, which relies on the database to
specify the data import’s structure.
● Sqoop imports and exports data using the MapReduce architecture, which offers
a parallel approach and fault tolerance.
● Sqoop provides a command line interface to make life easier for developers.

Important Features of Apache Sqoop

● Sqoop uses the YARN framework to import and export data. Parallelism is
enhanced by fault tolerance in this way.
● We may import the outcomes of a SQL query into HDFS using Sqoop.
● For several RDBMSs, including MySQL and Microsoft SQL servers, Sqoop offers
connectors.
● Sqoop supports the Kerberos computer network authentication protocol, allowing
nodes to authenticate users while securely communicating across an unsafe
network.
● Sqoop can load the full table or specific sections with a single command.

Q8) Explain architecture of apache sqoop along with diagram.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

1. Sqoop Client:
○ The Sqoop Client is the command-line interface (CLI) that allows users to
interact with Sqoop.
○ Users submit import/export commands via the Sqoop client, specifying
details such as database connection, tables, HDFS, Hive, or HBase
locations, and other configurations.
○ The client triggers the appropriate Sqoop job based on the user’s input.
2. Connector:
○ Connectors are the components that allow Sqoop to communicate with
different relational databases (such as MySQL, Oracle, PostgreSQL, SQL
Server, etc.).
○ Sqoop uses JDBC or ODBC drivers to connect to databases and execute
SQL queries to read or write data.
○ Some databases may have specialized connectors to optimize
performance for that specific database.
3. Mapper (MapReduce Framework):
○ The core of Sqoop's data transfer mechanism is based on Hadoop’s
MapReduce framework.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

○ During an import or export process, Sqoop breaks down the task into
multiple parallel mappers, which allow for faster processing of large
datasets by reading or writing data in chunks.
○ Each mapper reads or writes a portion of the data, allowing Sqoop to
handle very large datasets efficiently.
○ Sqoop does not use the reduce phase in MapReduce since it is mainly
focused on transferring data, not aggregation.
4. Sqoop Job:
○ A Sqoop Job is a logical unit of work in Sqoop, consisting of the mappers
and the necessary configurations for transferring data between the
database and Hadoop.
○ The job handles the data transfer process and can be scheduled to run at
specific intervals using Hadoop schedulers like Oozie.
5. HDFS / Hive / HBase:
○ Sqoop imports data from a relational database directly into HDFS, where it
can be stored for further processing or analysis.
○ In addition to HDFS, Sqoop can also import data into Hive tables (for
query processing) or into HBase (for NoSQL processing).
○ For export operations, Sqoop can take processed data from HDFS and
export it back into the relational database.
6. Relational Database (RDBMS):
○ The source or destination for the data in Sqoop's architecture is typically a
relational database, such as MySQL, Oracle, SQL Server, PostgreSQL, or
others.
○ Sqoop interacts with these databases to either import data into Hadoop or
export it from Hadoop back to the RDBMS.

Q9) State and explain different operations performed in apache sqoop.


Or
Explain role of sqoop import and export operations.

Ans:
Sqoop Import :

● The procedure is carried out with the aid of the sqoop import command.
● We can import a table from the Relational database management system to the
Hadoop database server with the aid of the import command.
● Each record loaded into the Hadoop database server as a single record is kept in
text files as part of the Hadoop framework.
● While importing data, we may also load and split Hive.
● Sqoop also enables the incremental import of data, which means that if we have
already imported a database and want to add a few more rows, we can only do it
with the aid of these functions, not the entire database.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Sqoop Export:

● The Sqoop export command facilitates the execution of the task with the aid of
the export command, which performs operations in reverse.
● Here, we can transfer data from the Hadoop database file system to the
relational database management system with the aid of the export command.
● Before the operation is finished, the data that will be exported is converted into
records.
● Two processes are involved in the export of data: the first is searching the
database for metadata, and the second is moving the data.

Q10) Distinguish between sqoop 1 and sqoop 2.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Q11) Differentiate between pig and hive.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Q12) List and explain common features of apache hive.

Ans:

1. SQL-Like Interface (HiveQL):

● HiveQL (Hive Query Language) is a SQL-like language used for querying and
managing large datasets in Hadoop. HiveQL is designed to be familiar to users
with SQL experience.
● It allows users to perform SQL operations such as SELECT, JOIN, GROUP BY,
ORDER BY, etc., making it easy for business analysts to query big data without
needing deep programming knowledge.

2. Data Warehousing Support:

● Hive is essentially a data warehousing solution for Hadoop. It provides a way to


organize and query large datasets stored in HDFS (Hadoop Distributed File
System).
● It supports partitioning, bucketing, and indexing of tables for optimized query
performance.

3. Schema on Read:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Hive uses a Schema-on-Read approach, meaning the data schema is applied


when the data is read, not when it is written. This allows users to store raw,
unstructured data in HDFS, but still be able to apply structured queries over the
data.
● This flexibility enables users to query data without needing to transform it during
ingestion.

4. Scalability:

● Hive is highly scalable and can handle petabytes of data. It leverages the
Hadoop MapReduce framework for parallel processing of large datasets across
multiple nodes in a cluster.
● Hive can also use other execution engines like Tez and Apache Spark for faster
query execution.

5. Support for Multiple Data Formats:

● Hive supports a wide range of data formats such as:


○ Text Files
○ Sequence Files
○ ORC (Optimized Row Columnar) format
○ Parquet
○ Avro
○ RCFile
● This flexibility allows Hive to efficiently handle structured, semi-structured, and
unstructured data.

6. Partitioning and Bucketing:

● Partitioning helps in improving query performance by dividing the table data into
smaller pieces based on certain column values (e.g., date, region). Queries can
skip entire partitions if they are not relevant, reducing the data to scan.
● Bucketing further subdivides the partitions into more manageable parts (called
buckets), allowing for faster joins and query processing by evenly distributing
data across the buckets.

7. Extensibility with UDFs (User-Defined Functions):

● Hive allows users to write custom User-Defined Functions (UDFs) in Java to


extend the capabilities of HiveQL beyond its built-in functions.
● Users can create their own functions to process data in ways that aren’t natively
supported by Hive.

Q13) What are the different datatype used in hive.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

I. Column Types
Column type are used as column data types of Hive. They are as follows:

● Integral Types
Integer type data can be specified using integral data types, INT. When the data
range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

● String Types
String type data types can be specified using single quotes (' ') or double quotes
(" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.

The following table depicts various CHAR data types:

● Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and
format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

● Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision.

● Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union.

II. Literals
● Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.

● Decimal Type
Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.

III. Null Value


Missing values are represented by the special value NULL.

IV.Complex Types
● Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>

● Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>

● Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Q14) Explain how hive is different and better than sql.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Ans:

1. Data Storage and Scalability

● Hive: Built on Hadoop, Hive is designed to handle massive amounts of data,


often in petabytes. It can scale horizontally by adding more nodes to the cluster,
allowing it to efficiently process large datasets stored in HDFS (Hadoop
Distributed File System).
● SQL Databases: Traditional SQL databases are typically designed for vertical
scaling, meaning they require more powerful hardware to manage larger
datasets. While some SQL databases can scale out (e.g., sharding), they often
have limitations compared to Hive’s architecture.

2. Schema-on-Read vs. Schema-on-Write

● Hive: Uses a schema-on-read approach, allowing users to apply schemas to


data only when querying it. This flexibility lets users store raw, unstructured data
in HDFS and apply structure later, making it easier to manage diverse data
types.
● SQL Databases: Generally use a schema-on-write approach, requiring data to
conform to a predefined schema before it can be stored. This rigid structure can
make it difficult to work with unstructured or semi-structured data.

3. Query Language

● Hive: Uses HiveQL, a SQL-like query language that is more suitable for batch
processing and analytical tasks. While it resembles SQL, it lacks certain
functionalities and optimizations found in traditional SQL.
● SQL Databases: Use standard SQL, which is well-defined and optimized for
transaction processing. SQL is designed for rapid and efficient retrieval of
records, especially for small datasets.

4. Execution Model

● Hive: Primarily runs on MapReduce, which is designed for batch processing and
can be slower for interactive queries. Hive has improved execution with support
for Apache Tez and Apache Spark, but it is still primarily optimized for batch
workloads.
● SQL Databases: Generally use a transaction-oriented execution model that
supports real-time querying and updates. They are optimized for low-latency and
interactive query performance.

5. Data Types and Structure

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Hive: Supports complex data types like arrays, maps, and structs, allowing for
more flexible data modeling. It is particularly useful for semi-structured data.
● SQL Databases: Typically focus on structured data with fixed schemas. While
some SQL databases support JSON or XML data types, they often lack the
flexibility of Hive’s complex types.

6. Performance and Optimizations

● Hive: Designed for batch processing of large datasets rather than real-time
transactions. Although Hive has implemented optimizations such as partitioning
and bucketing, it is not as performant as traditional SQL databases for
transactional workloads.
● SQL Databases: Optimized for performance with features like indexes, caching,
and query optimization, allowing for rapid retrieval and updates of data in
transaction-heavy applications.

Q15) Explain architecture hive .

Ans:

User Interface (UI):

● The UI is the front-end layer that allows users to interact with Hive. It provides
different interfaces like the CLI (Command Line Interface), Hive Web Interface,
and Thrift Service. Users submit HiveQL queries via these interfaces, which are
then passed to the Driver for processing.

Driver:

● The Driver is responsible for managing the lifecycle of a HiveQL query. It


receives the query from the UI, parses it, and handles the execution flow.
● Key functions of the Driver:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

○ Session Management: Manages sessions and keeps track of information


like configuration settings for the current session.
○ Parsing and Compilation: It parses the HiveQL query into an abstract
syntax tree (AST), checks for errors, and converts it into a logical plan.

Compiler:

● The Compiler takes the parsed query and converts it into a directed acyclic graph
(DAG) of MapReduce, Tez, or Spark jobs.
● It optimizes the logical plan and creates physical execution plans based on
available resources and the underlying execution engine (MapReduce, Tez, or
Spark).

Optimizer:

● The Optimizer applies various optimization techniques to improve query


execution. It uses rules like predicate pushdown, column pruning, and join
optimization to minimize the amount of data processed and improve
performance.

Execution Engine:

● The Execution Engine executes the physical plan generated by the compiler. It
coordinates the distributed execution of jobs across the Hadoop cluster.
● Hive can use multiple execution engines, including:
○ MapReduce (default)
○ Tez (for better performance)
○ Spark (for even faster execution in memory).

Metastore:

● The Metastore is the central repository of metadata in Hive. It stores information


about tables, columns, partitions, data types, and the location of data files in
HDFS.
● The Metastore helps Hive manage and optimize queries by providing schema
information and statistics about tables.

Storage Layer (HDFS):

● The Hadoop Distributed File System (HDFS) serves as the storage layer for
Hive. All the actual data resides in HDFS, while Hive provides the interface to
query and manage this data.
● Hive supports a wide variety of data formats, including Text, ORC, Parquet,
Sequence File, and Avro.

Q16) state and explain basic features of apache avvro.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Ans:

1. Data Serialization

● Avro provides a compact binary format for data serialization, which means it
efficiently encodes data for storage or transmission. This compactness leads to
reduced disk space usage and improved network performance.

2. Schema-Based

● Avro uses a schema to define the structure of data. The schema is written in
JSON format, allowing users to define fields, data types, and their relationships.
The schema serves as a contract between data producers and consumers,
ensuring compatibility and consistency.

3. Dynamic Typing

● Unlike some serialization frameworks that require a fixed schema at compile


time, Avro supports dynamic typing. This means you can read and write data
without the need for a predefined schema. This flexibility allows for easier
updates and changes to data structures.

4. Language Independence

● Avro supports multiple programming languages, including Java, Python, C, C++,


and more. This makes it easy to use Avro for data exchange between systems
written in different languages, promoting interoperability in diverse environments.

5. Rich Data Structures

● Avro supports complex data types such as arrays, maps, records, and unions.
This allows users to model complex data structures that are common in big data
applications, enabling more expressive and organized data representations.

6. Schema Evolution

● Avro allows for schema evolution, meaning you can change the schema over
time without breaking compatibility with existing data. Avro supports forward and
backward compatibility, enabling producers and consumers to read and write
data using different schema versions.

7. Self-Describing Data

● Avro data files include the schema with the data, making it self-describing. This
feature eliminates the need for external metadata, simplifying data interchange
and making it easier for applications to understand the data structure.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Q17) What are the various data type used in apache avro?

Ans:
1. Primitive Data Types

2. Complex Data Types

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Q18) Differentiate between primitive and complex datatype.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Q19) What you mean by serialization in apache avro.

Ans:

Serialization is the process of converting an object or data structure into a format that
can be easily stored (in files, databases, etc.) or transmitted (over networks). In the
context of Apache Avro, serialization refers to the transformation of data into a binary
format (or JSON format) based on a defined schema.

Key Concepts of Serialization in Apache Avro

1. Schema-Based Serialization:
○ Avro uses a schema to define the structure of the data. The schema
specifies the data types and organization, ensuring that data is serialized
consistently.
○ The schema is written in JSON format and acts as a contract between the
data producer and consumer.
2. Compact Binary Format:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

○ Avro serializes data in a compact binary format, which makes it more


efficient in terms of storage space and network bandwidth compared to
text-based formats (like CSV or JSON).
○ This compactness leads to faster read/write operations and reduced disk
usage.
3. Self-Describing Data:
○ Avro's serialized data includes the schema along with the data. This
means that when you read the data, the schema is included, making it
self-describing.
○ This feature eliminates the need for external metadata, simplifying data
interchange between different systems.
4. Dynamic Typing:
○ Avro allows for dynamic typing, meaning that data can be serialized and
deserialized without requiring a predefined schema at compile time. This
flexibility is especially useful in evolving data environments.
5. Support for Complex Data Structures:
○ Avro supports various data types, including complex types like arrays,
maps, and records. This allows for rich data serialization and the ability to
model complex data relationships.
6. Schema Evolution:
○ Avro supports schema evolution, meaning that changes can be made to
the schema over time without breaking compatibility with existing data.
This allows for backward and forward compatibility, facilitating easier
updates to data structures.

Q20) Write a short note on advantages and disadvantages of hbase.

Ans:
Advantages Of Apache HBase:

● Scalability: HBase can handle extremely large datasets that can be distributed
across a cluster of machines. It is designed to scale horizontally by adding more
nodes to the cluster, which allows it to handle increasingly larger amounts of
data.
● High-performance: HBase is optimized for low-latency, high-throughput access to
data. It uses a distributed architecture that allows it to process large amounts of
data in parallel, which can result in faster query response times.
● Flexible data model: HBase’s column-oriented data model allows for flexible
schema design and supports sparse datasets. This can make it easier to work
with data that has a variable or evolving schema.
● Fault tolerance: HBase is designed to be fault-tolerant by replicating data across
multiple nodes in the cluster. This helps ensure that data is not lost in the event
of a hardware or network failure.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Disadvantages Of Apache HBase:

● Complexity: HBase can be complex to set up and manage. It requires knowledge


of the Hadoop ecosystem and distributed systems concepts, which can be a
steep learning curve for some users.
● Limited query language: HBase’s query language, HBase Shell, is not as feature-
rich as SQL. This can make it difficult to perform complex queries and analyses.
● No support for transactions: HBase does not support transactions, which can
make it difficult to maintain data consistency in some use cases.
● Not suitable for all use cases: HBase is best suited for use cases where high
throughput and low-latency access to large datasets is required. It may not be
the best choice for applications that require real-time processing or strong
consistency guarantees

Q21) State and explain components of hbase data model.

Ans:
The Data Model in HBase is made of different logical components such as Tables,
Rows, Column Families, Columns, Cells and Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions. As shown above, every Region is then served by exactly one
Region Server. The figure above shows a representation of a Table.

Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
are unique in a Table and are always treated as a byte[].

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Column Families – Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the basic unit
of physical storage to which certain HBase features like compression are applied.
Hence it’s important that proper care be taken when designing Column Families in
table.

Columns – A Column Family is made of one or more columns. A Column is identified


by a Column Qualifier that consists of the Column Family name concatenated with the
Column name using a colon – example: columnfamily:columnname. There can be
multiple Columns within a Column Family and Rows within a table can have varied
number of Columns.

Cell – A Cell stores data and is essentially a unique combination of rowkey, Column
Family and the Column (Column Qualifier). The data stored in a Cell is called its value
and the data type is always treated as byte[].

Version – The data stored in a cell is versioned and versions of data are identified by
the timestamp. The number of versions of data retained in a column family is
configurable and this value by default is 3.

Q22) Explain read and write process of hbase.

Ans:

Write Process in HBase

When data is written to HBase, it follows a specific flow:

1. Client Request:
○ A client sends a write request to an HBase region server. The write
request usually consists of the row key, column family, column qualifier,
and the value to be written.
2. MemStore:
○ The region server first writes the data to an in-memory structure called the
MemStore. Each column family has its own MemStore.
○ The MemStore is a write-ahead log (WAL) that temporarily stores write
operations. It allows for fast write performance since it involves writing to
memory rather than directly to disk.
3. Write-Ahead Log (WAL):
○ Simultaneously, the write request is recorded in the WAL, which is stored
on disk. This ensures durability and allows for recovery in case of failures.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

○ The WAL provides a backup mechanism that can be used to replay


operations in the event of a crash.
4. Flush to HFiles:
○ When the MemStore reaches a certain size threshold, it triggers a flush
operation. The data in the MemStore is written to an HFile in the HDFS
(Hadoop Distributed File System).
○ HFiles are the immutable files that store the actual data. This process
allows HBase to manage memory efficiently while ensuring durability.
5. Data Availability:
○ Once the flush is complete, the data becomes available for read
operations. However, because the data is first written to the MemStore, it
may not be immediately visible for read requests until the flush occurs.

Read Process in HBase

The read process in HBase is designed to ensure fast and efficient data retrieval:

1. Client Request:
○ A client sends a read request to a region server specifying the row key for
the desired data.
2. MemStore Check:
○ The region server first checks the MemStore for the requested row. If the
data is found in the MemStore, it is returned directly to the client, providing
a fast response.
3. HFile Lookup:
○ If the data is not found in the MemStore, the region server looks for the
data in the HFiles stored in HDFS.
○ The region server performs a lookup based on the row key in the HFiles. It
reads the HFiles to retrieve the requested data.
4. Bloom Filters:
○ HBase uses Bloom filters to optimize read operations. Bloom filters are
probabilistic data structures that help determine if a row key is likely to
exist in a specific HFile. This helps reduce unnecessary disk reads.
○ If the Bloom filter indicates that the row key is not present in the HFile, the
lookup is skipped, saving time and resources.
5. Return Data:
○ Once the data is found (either in the MemStore or HFiles), it is returned to
the client.
6. Caching:
○ HBase caches frequently accessed data to speed up future read
operations. The caching mechanism stores recently read rows in memory,
allowing for faster access during subsequent requests.

Q23)Describe in detail hbase architecture.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

HMaster
● HMaster in HBase is the HBase architecture’s implementation of a Master server.
● It serves as a monitoring agent for all Region Server instances in the cluster
along with as an interface for any metadata updates.
● HMaster runs on NameNode in a distributed cluster context.

HMaster plays the following critical responsibilities in HBase:

● It is critical in terms of cluster performance and node maintenance.


● HMaster manages admin performance, distributes services to regional servers,
and assigns regions to region servers.
● HMaster includes functions such as regulating load balancing and failover to
distribute demand among cluster nodes.
● When a client requests that any schema or Metadata operations be changed,
HMaster assumes responsibility for these changes.

Region Server
● The Region Server, also known as HRegionServer, is in charge of managing and
providing certain data areas in HBase.
● A region is a portion of a table's data that consists of numerous contiguous rows
ordered by the row key.
● Each Region Server is in charge of one or more regions that the HMaster
dynamically assigns to it.
● The Region Server serves as an intermediary for clients sending write or read
requests to HBase, directing them to the appropriate region based on the
requested column family.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Clients can connect with the Region Server without requiring HMaster
authorization, allowing for efficient and direct access to HBase data.

Zookeeper
ZooKeeper is a centralized service in HBase that maintains configuration information,
provides distributed synchronization, and provides naming and grouping functions.

Q24) Differentiate between hbase and hdfs.

Ans:

Q25) What is a zookeeper? Features of zookeeper.

Ans:
● Zookeeper is a distributed, open-source coordination service for distributed
applications.
● It exposes a simple set of primitives to implement higher-level services for
synchronization, configuration maintenance, and group and naming.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● In a distributed system, there are multiple nodes or machines that need to


communicate with each other and coordinate their actions.
● ZooKeeper provides a way to ensure that these nodes are aware of each other
and can coordinate their actions.
● It does this by maintaining a hierarchical tree of data nodes called “Znodes“,
which can be used to store and retrieve data and maintain state information.
● ZooKeeper provides a set of primitives, such as locks, barriers, and queues, that
can be used to coordinate the actions of nodes in a distributed system.
● It also provides features such as leader election, failover, and recovery, which
can help ensure that the system is resilient to failures.
● ZooKeeper is widely used in distributed systems such as Hadoop, Kafka, and
HBase, and it has become an essential component of many distributed
applications.

Features of ZooKeeper

1. Centralized Coordination:
○ ZooKeeper acts as a centralized service that coordinates distributed
applications. It helps in maintaining the state of different components of
distributed systems and ensures they work together seamlessly.
2. High Availability:
○ ZooKeeper is designed to be highly available and fault-tolerant. It uses a
quorum-based mechanism to ensure that even if some nodes fail, the
service can still function correctly. This redundancy allows ZooKeeper to
maintain availability in case of node failures.
3. Consistency:
○ ZooKeeper provides strong consistency guarantees. All clients see the
same view of the data at any given time, ensuring that updates are atomic
and sequentially consistent.
4. Hierarchical Namespace:
○ ZooKeeper uses a hierarchical tree structure to store its data. This
structure allows clients to organize their data logically, similar to a file
system with directories and files.
5. Watchers:
○ Clients can register "watchers" on specific nodes in ZooKeeper. When the
data or state of a node changes, ZooKeeper notifies the registered clients,
enabling them to react to changes in real-time.

Q26) State and explain different services provides by zoo keeper.

Ans:

1. Configuration Management

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Description: ZooKeeper allows distributed applications to store and manage


configuration data centrally.
● Functionality:
○ Applications can read configuration data from ZooKeeper, ensuring all
nodes in the cluster use the same configuration.
○ It supports dynamic configuration updates, enabling applications to react
to changes without requiring restarts.

2. Synchronization

● Description: ZooKeeper provides mechanisms for synchronization among


distributed components.
● Functionality:
○ It allows multiple nodes to coordinate and share resources without
conflicts.
○ For example, distributed locking can be implemented using ZooKeeper to
manage access to shared resources, ensuring that only one node can
access a resource at a time.

3. Leader Election

● Description: ZooKeeper facilitates leader election among nodes in a distributed


system.
● Functionality:
○ It helps determine which node acts as the leader in scenarios where a
single instance is required for coordination (e.g., master/slave
configurations).
○ By electing a leader, ZooKeeper ensures consistency in operations and
reduces conflicts among nodes.

4. Group Membership

● Description: ZooKeeper maintains information about the state of nodes in a


distributed system.
● Functionality:
○ It allows nodes to join and leave groups dynamically, updating the
membership information.
○ Clients can watch for changes in group membership, enabling them to
respond to node failures or new additions in real-time.

5. Naming Service

● Description: ZooKeeper acts as a naming service for distributed applications.


● Functionality:
○ It provides a way for applications to register and discover services using a
hierarchical namespace.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

○ Clients can easily find and connect to distributed services by querying


ZooKeeper.

6. Data Storage

● Description: ZooKeeper provides a simple data storage mechanism based on a


hierarchical key-value structure.
● Functionality:
○ It allows applications to store small amounts of data (e.g., metadata, state
information) that can be accessed quickly.
○ Data stored in ZooKeeper can be read and updated atomically, ensuring
consistency.

Q27) Distinguish between hbase and rdbms.

Ans:

Q28) What is data analytics. State and explain any 5 analytical tools with features.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Ans:
Data analytics is the process of examining, cleaning, transforming, and modeling data to
discover useful information, draw conclusions, and support decision-making. It involves
using statistical methods and software tools to analyze data sets, identify patterns,
trends, and relationships, and generate insights that can guide business strategies,
improve operations, and enhance performance.

Five Analytical Tools


. Hadoop

● Overview: Apache Hadoop is an open-source framework designed for distributed


storage and processing of large datasets across clusters of computers using
simple programming models.
● Key Features:
○ Scalability: Can scale horizontally by adding more nodes to the cluster.
○ Cost-Effectiveness: Uses commodity hardware, making it cheaper to
store and process big data.
○ Distributed Storage: Utilizes Hadoop Distributed File System (HDFS) for
fault-tolerant data storage.
○ Data Processing: Supports batch processing through MapReduce and
integrates with other tools for real-time processing.
○ Ecosystem: Integrates with a wide range of tools like Hive, Pig, and
Spark for advanced analytics and data processing.

2. Spark

● Overview: Apache Spark is an open-source unified analytics engine for big data
processing, with built-in modules for streaming, SQL, machine learning, and
graph processing.
● Key Features:
○ In-Memory Processing: Provides faster data processing by storing data
in memory, reducing I/O operations.
○ Ease of Use: Offers APIs in Java, Scala, Python, and R, making it
accessible to a wide range of developers.
○ Versatile: Supports various data sources, including HDFS, Cassandra,
HBase, and S3.
○ Real-Time Processing: Capable of processing data in real-time with
Spark Streaming.
○ Machine Learning: Built-in library (MLlib) for machine learning tasks,
providing algorithms and utilities.

3. MongoDB

● Overview: MongoDB is a document-oriented NoSQL database that stores data in


JSON-like format (BSON), providing high flexibility and scalability.

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● Key Features:
○ Schema Flexibility: Supports dynamic schemas, allowing for the storage
of unstructured and semi-structured data.
○ Scalability: Easily scales horizontally through sharding, distributing data
across multiple servers.
○ High Performance: Offers fast read and write operations due to its in-
memory capabilities and indexing options.
○ Rich Query Language: Provides a powerful query language for retrieving
and manipulating data.
○ Aggregation Framework: Supports complex data processing and
transformation operations using the aggregation pipeline.

4. Cassandra

● Overview: Apache Cassandra is a distributed NoSQL database designed to


handle large amounts of data across many commodity servers while providing
high availability with no single point of failure.
● Key Features:
○ High Availability: Offers continuous availability with no downtime and
fault tolerance.
○ Scalability: Scales horizontally by adding nodes to the cluster without
downtime.
○ Distributed Architecture: Utilizes a peer-to-peer architecture, allowing
any node to handle read and write requests.
○ Flexible Data Model: Supports wide-column stores, enabling efficient
storage of large datasets with various structures.
○ Tunable Consistency: Allows developers to configure the level of
consistency required for operations.

5. SAS (Statistical Analysis System)

● Overview: SAS is a software suite used for advanced analytics, business


intelligence, data management, and predictive analytics.
● Key Features:
○ Comprehensive Analytics: Provides tools for data mining, statistical
analysis, forecasting, and optimization.
○ User-Friendly Interface: Offers a graphical interface for users, making it
accessible to non-programmers.
○ Robust Data Management: Capable of handling large datasets with
efficient data manipulation and preparation tools.
○ Integration: Easily integrates with other systems and data sources,
including databases, Hadoop, and cloud services.
○ Advanced Statistical Procedures: Offers a wide range of statistical
methods and procedures for in-depth analysis.

Q29) State and explain different types of data analytics. (Descriptive,predictive,


prescriptive).

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

Ans:

1. Descriptive Analytics

Overview: Descriptive analytics focuses on summarizing historical data to understand


what has happened in the past. It provides insights into trends, patterns, and behaviors
by analyzing raw data.

Key Features:

● Data Summary: Uses measures such as mean, median, mode, and standard
deviation to summarize data.
● Reporting: Generates reports and dashboards to present historical data visually
(e.g., graphs, charts).
● Trend Analysis: Identifies patterns over time, helping businesses understand
how performance metrics have changed.
● Use Cases: Commonly used in business intelligence for sales reports, customer
behavior analysis, and financial performance reviews.

Examples:

● A company analyzing sales data from the past year to determine peak sales
months.
● A healthcare provider reviewing patient data to identify trends in patient
admissions over time.

2. Predictive Analytics

Overview: Predictive analytics uses statistical models and machine learning techniques
to forecast future outcomes based on historical data. It aims to identify trends and
patterns that can help anticipate future events.

Key Features:

● Statistical Modeling: Utilizes regression analysis, time series analysis, and


classification techniques to make predictions.
● Machine Learning: Employs algorithms to learn from historical data and improve
accuracy over time.
● Risk Assessment: Helps organizations identify potential risks and opportunities
by forecasting outcomes.
● Use Cases: Commonly used in finance for credit scoring, in marketing for
customer segmentation, and in healthcare for predicting disease outbreaks.

Examples:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

● A retail company using predictive analytics to forecast future sales based on past
purchasing behavior.
● An insurance company predicting the likelihood of claims based on customer
demographics and historical claims data.

3. Prescriptive Analytics

Overview: Prescriptive analytics goes beyond predicting future outcomes; it


recommends actions to achieve desired results. It combines data, business rules, and
algorithms to suggest the best course of action.

Key Features:

● Optimization Models: Uses mathematical models to identify the best possible


solutions for a given problem.
● Simulation: Employs simulation techniques to assess the impact of different
scenarios and decisions.
● Decision Support: Provides actionable recommendations based on predictive
insights and constraints.
● Use Cases: Commonly used in supply chain management, resource allocation,
and strategic planning.

Examples:

● A logistics company using prescriptive analytics to optimize delivery routes and


reduce transportation costs.
● A healthcare provider recommending personalized treatment plans based on
patient data and predictive insights.

Q30) Compare 3 types of data analytics along with suitable example.

Ans:

S K Somaiya College, Somaiya Vidyavihar University


Name:- Tisha Sachin Shah
Roll no:- 31031523031

S K Somaiya College, Somaiya Vidyavihar University

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy