Bdav QB
Bdav QB
Q) Types of BigData?
Q) Comparison of Bigdata with Conventional Data?
Q) Challenges Of Big Data?
Q) Big Data Applications?
Q) What is Hadoop and its Modules?
Q) What is Hadoop Architecture?
The Hadoop architecture is a framework designed for distributed storage and processing of large
data sets across clusters of computers. Its core components include:
• It's the primary storage system that stores large data sets across nodes in a Hadoop
cluster.
• HDFS breaks down large files into smaller blocks and replicates them across multiple
DataNodes for fault tolerance.
• It manages resources in the cluster and facilitates job scheduling and task execution
on individual nodes.
3. MapReduce Engine:
• This is the processing model in Hadoop that allows computation of large datasets.
• Master nodes typically include NameNode (managing metadata for HDFS) and
ResourceManager (managing resources and job scheduling).
• Slave nodes consist of DataNodes (storing data blocks in HDFS) and NodeManagers
(managing resources and executing tasks in YARN).
5. Hadoop Ecosystem:
• Apart from the core components, the Hadoop ecosystem comprises various tools
and frameworks like Hive, Pig, HBase, Spark, and more.
Advantages of HDFS:
1. Scalability: HDFS can scale horizontally by adding more commodity hardware to the cluster.
It can accommodate petabytes of data by distributing it across multiple nodes.
2. Fault Tolerance: HDFS achieves fault tolerance by replicating data across multiple nodes in
the cluster. If a node fails, the data remains accessible from replicated copies on other nodes.
3. High Throughput: It's optimized for streaming data access rather than random access,
making it well-suited for applications that require high throughput, especially for large files.
4. Cost-Effective: It can run on inexpensive hardware, utilizing the concept of data locality
where computation is performed on the node where data resides, reducing the need for
expensive specialized hardware.
6. Data Replication: Replication of data provides data redundancy and increases data
availability. It ensures data reliability by storing multiple copies of data across different
nodes.
Disadvantages of HDFS:
1. Small File Problem: HDFS is not optimized for handling a large number of small files due to
the overhead of metadata storage and maintenance, as each file is represented as a separate
block.
2. Latency: HDFS is optimized for batch processing and high throughput rather than low-latency
access. It might not be suitable for applications requiring low-latency access to data.
3. Data Consistency: HDFS sacrifices immediate data consistency for availability and fault
tolerance. In some cases, eventual consistency might lead to issues for applications requiring
strict consistency.
4. Complexity: Configuring and managing an HDFS cluster can be complex and might require
expertise in distributed systems and administration.
5. Single Point of Failure: Although HDFS has mechanisms for fault tolerance, the NameNode
(which manages metadata) can be a single point of failure. However, solutions like HDFS
Federation and High Availability (HA) mitigate this issue.
6. Storage Overhead: The default replication factor in HDFS contributes to storage overhead
because it replicates data across multiple nodes, consuming more storage space.
Q) How Does Hadoop Work?
Q) YARN Architecture?
Q) Application Workflow in Hadoop Yarn?
Q) HDFS Architecture?
Q) Features Of HDFS?
1. Distributed and Parallel Computation:
• HDFS enables distributed and parallel computation by dividing large datasets into
smaller blocks (default block size is 128 MB) and distributing these blocks across
multiple nodes in a Hadoop cluster.
• This division allows for parallel processing, as different nodes can work on different
blocks simultaneously, enhancing performance and speed of data processing.
2. Highly Scalable:
• This scaling approach avoids the limitations associated with vertical scaling (scale-
up), where adding resources to a single machine has limitations and may require
downtime.
3. Replication:
• Replication is a core feature of HDFS. It ensures fault tolerance and data availability
by creating multiple replicas of each data block and distributing them across
different DataNodes in the cluster.
• The default replication factor is configurable and ensures that if a node containing a
data block fails, the data can be accessed from replicated copies stored on other
nodes.
4. Fault Tolerance:
6. Portability:
7. Cost-Effective:
• HDFS utilizes commodity hardware for DataNodes, which are less expensive
compared to specialized hardware, thereby reducing storage costs.
8. Handling Large Datasets:
• HDFS is capable of storing and handling data of varying sizes, ranging from small files
to petabytes of data. It supports structured, unstructured, and semi-structured data
formats.
9. High Availability:
• HDFS ensures high availability by replicating data blocks across nodes and enabling
failover mechanisms for NameNodes to prevent data unavailability during node
failures.
• Data locality refers to the practice of moving computation to where the data resides,
minimizing data movement across the network and optimizing performance. HDFS
supports data locality by distributing computation closer to the data.
Q) Short note on file system namespace?
The file system namespace refers to the hierarchical structure used by a file system to organize and
manage files and directories. It defines the logical structure and naming conventions for files and
directories within a storage system.
2. Directories and Files: Directories (also known as folders) can contain both files and other
directories, forming a parent-child relationship. Each directory and file has a unique name
within its parent directory.
3. Naming Convention: The file system namespace provides rules and conventions for naming
files and directories. Filenames are often unique within a directory and can include
characters allowed by the specific file system (such as letters, numbers, symbols).
4. Path Representation: Paths are used to navigate through the namespace and specify the
location of a file or directory within the structure. Absolute paths start from the root
directory, while relative paths are specified from the current working directory.
5. Namespace Operations: File system operations such as creating, deleting, moving, copying,
and accessing files and directories rely on the namespace structure. These operations
maintain the integrity and organization of the namespace.
6. File System Metadata: Each file and directory entry in the namespace is associated with
metadata (attributes) such as permissions, creation date, file size, ownership, and other
properties managed by the file system.
7. Logical View: The namespace provides users and applications with a logical view of the
stored data, abstracting the physical storage details (like disk sectors or blocks) and
presenting a structured hierarchy for easier data management.
Q) Explain vertical and horizontal scaling?
Vertical Scaling (Scale-up):
1. Definition: Vertical scaling, also known as scaling up, involves increasing the resources of an
existing single server or machine to handle more load or improve performance.
2. Method:
• It focuses on adding more CPU power, memory (RAM), storage, or other resources to
the same server or machine.
3. Advantages:
• Can be suitable for applications where a single powerful machine can meet the
demands.
4. Disadvantages:
• Limited scalability: There's a finite limit to how much a single machine can be
upgraded.
• Risk of single point of failure: If the single server fails, it can cause downtime for the
entire system.
1. Definition: Horizontal scaling, also known as scaling out, involves increasing the capacity or
performance of a system by adding more machines or nodes to a network or cluster.
2. Method:
• Instead of upgrading a single machine, horizontal scaling distributes the load across
multiple machines, sharing the workload.
3. Advantages:
• Improved scalability: It's easier to add more machines as needed, allowing for
greater scalability.
• Redundancy and fault tolerance: Redundancy across multiple machines reduces the
risk of a single point of failure.
4. Disadvantages:
• Increased complexity in managing a cluster of machines or nodes.
• Potential for higher networking costs and overhead for communication between
nodes.
Q) Describe Rack Awareness.
Rack awareness is a concept within Hadoop's HDFS (Hadoop Distributed File System) that aims to
optimize data storage and fault tolerance by considering the physical network topology of the cluster.
It ensures that data replication occurs across multiple racks within a data center, enhancing both
data reliability and network efficiency.
• In a large Hadoop cluster, servers (nodes) are organized into racks within a data
center.
2. Data Replication:
• HDFS replicates data blocks across different nodes within the cluster to ensure fault
tolerance.
• Rack awareness ensures that data replicas are stored across different racks rather
than solely within the same rack.
• By spreading data replicas across multiple racks, rack awareness minimizes the risk of
data loss or unavailability in case an entire rack or network switch fails.
• It ensures that copies of the same data block are stored on different racks, reducing
the impact of rack-level failures.
4. Network Efficiency:
• By considering the physical distance and network topology, HDFS tries to store
replicas on different racks, avoiding network traffic congestion that might occur if
multiple replicas are stored on the same rack.
• It promotes efficient data access by allowing data retrieval from the nearest rack,
reducing network latency.
• The replication factor in HDFS (default is 3) determines how many copies of each
data block are maintained. Rack awareness ensures that these replicas are placed
strategically across racks.
7. Enhanced Reliability:
• Overall, rack awareness significantly enhances the reliability and availability of data
stored in HDFS by ensuring data replication across different racks and mitigating the
risks associated with rack-level failures.
Q) What is Map Reduce?
MapReduce is a programming model and processing paradigm designed to handle and process vast
amounts of data in a distributed computing environment. It's a core component of the Apache
Hadoop framework and serves as a fundamental approach for parallel data processing across a
cluster of computers.
1. Map Phase:
• The input data is divided into smaller chunks and processed in parallel across
multiple nodes in the cluster.
• During the map phase, a user-defined function called the "mapper" processes each
chunk of input data independently.
• The mapper converts the input data into a set of intermediate key-value pairs.
• The intermediate key-value pairs generated by the mappers are shuffled and sorted
based on their keys.
• This process ensures that all values associated with a particular key are grouped
together, preparing the data for the next phase.
3. Reduce Phase:
• In the reduce phase, another user-defined function called the "reducer" takes the
sorted intermediate key-value pairs as input.
• The reducer processes these key-value pairs, aggregates values associated with the
same key, and produces the final output.
1. Scalability: MapReduce allows for distributed processing of large datasets across a cluster of
commodity hardware, providing scalability to handle massive volumes of data.
2. Fault Tolerance: It inherently offers fault tolerance by replicating data and rerunning tasks in
case of node failures.
4. Parallel Processing: Data processing occurs in parallel across multiple nodes in the cluster,
leading to faster computation and efficient utilization of resources.
5. Batch Processing: It is well-suited for batch processing tasks where data can be processed in
discrete batches rather than real-time streaming data.
6. Wide Applicability: While initially associated with Hadoop, the MapReduce paradigm has
influenced various other distributed computing frameworks and is widely used in big data
processing.
Q) Explain Map Reduce architecture with a neat labelled diagram.?
MapReduce Workflow:
1. Job Submission:
• The client application submits a MapReduce job to the JobTracker (in MR1) or
ResourceManager (in YARN) in the Hadoop cluster.
2. Job Initialization:
• The JobTracker/ResourceManager initializes the job and divides the input data into
smaller chunks known as input splits. Each split is assigned to a map task.
3. Map Phase:
• Map tasks are assigned to available TaskTrackers (in MR1) or NodeManagers (in
YARN) in the cluster.
• Each map task executes the user-defined map function on its input split
independently, generating intermediate key-value pairs.
• Optionally, a combine function can aggregate and compress the intermediate data
locally on mapper nodes to reduce data transfer during the shuffle phase.
• Partitioning ensures that intermediate data with the same key are sent to the same
reducer. It determines which reducer will process each intermediate key.
• Intermediate key-value pairs from all mappers are shuffled across the network to the
appropriate reducers based on the partitioning.
• Sort and shuffle involve transferring, sorting, and grouping the intermediate data by
keys, ensuring that all values for a given key are sent to the same reducer.
6. Reduce Phase:
• Reducers start processing the intermediate key-value pairs received from the
mappers.
• For each unique intermediate key, the reducer executes the user-defined reduce
function to aggregate and process the associated values.
• The reducer produces the final output (processed and aggregated data) and writes it
to the HDFS or an output directory.
• Once all map and reduce tasks are completed, the JobTracker/ResourceManager
marks the job as finished.
• The client application can access the final output stored in the specified HDFS
directory or output location.
Applications of MapReduce:
2. Batch Processing:
3. Log Analysis:
• MapReduce is utilized for log processing and analysis, handling extensive log files
generated by applications, servers, or systems.
• Search engines utilize MapReduce for web crawling, data extraction, and indexing
large amounts of web content for search queries.
• It's used for data transformation, manipulation, and ETL tasks in data warehousing
and data integration pipelines.
• MapReduce supports various machine learning algorithms and data mining tasks by
processing and analyzing large datasets to derive insights.
7. Graph Processing:
• It's employed for graph-based algorithms, such as social network analysis, page rank
calculation, and graph traversal.
• MapReduce is used for text analytics, sentiment analysis, and other NLP tasks on
vast amounts of textual data.
Features of MapReduce:
1. Scalability:
2. Fault Tolerance:
• It inherently provides fault tolerance by replicating data and rerunning tasks in case
of node failures.
3. Parallel Processing:
• Data processing occurs in parallel across multiple nodes, enabling faster computation
and efficient resource utilization.
4. Simplicity:
5. Wide Applicability:
• It's integrated with the Hadoop ecosystem, allowing seamless interaction with other
Hadoop tools and frameworks.
8. Cost-Effectiveness:
The Partitioner in MapReduce is responsible for determining which reducer will receive the output of
each map task. It ensures that all values associated with the same key from different mappers are
sent to the same reducer. The main purpose of the Partitioner is to distribute intermediate key-value
pairs across reducers based on the key's hash code.
Key Characteristics:
1. Partitioning Logic:
• By default, Hadoop uses a hash code modulo operation to distribute keys evenly
among reducers.
• It ensures that all intermediate key-value pairs with the same key, regardless of
which mapper generated them, are sent to the same reducer.
• This co-location of keys helps reducers process data efficiently, aggregating values
associated with the same key in a single location.
3. Custom Partitioning:
• Users can implement custom Partitioners to control how keys are partitioned across
reducers.
• Custom Partitioners enable more specific ways of distributing keys, which might be
beneficial based on the data distribution and reduce tasks.
4. Number of Partitions:
The Combiner in MapReduce is an optional component that performs a local aggregation of the
intermediate key-value pairs on the mapper nodes before sending them to the reducers. It is similar
to a mini-reducer and helps reduce the volume of data shuffled across the network during the
MapReduce job.
Key Characteristics:
1. Local Aggregation:
• The Combiner operates on the output of the map phase before the shuffle and sort
phase.
• It aggregates and combines intermediate values associated with the same key,
reducing the data volume that needs to be transferred to reducers.
• The Combiner is especially useful when the output of the map phase generates a
large volume of intermediate data for a specific key.
• It is employed for tasks involving extensive data sharing with common keys, leading
to efficiency improvements in MapReduce jobs.
4. Caution in Functionality:
• The Combiner function must be associative, commutative, and have the same input
and output types as the reducer's reduce function.
• Not all operations can be used as a Combiner due to their properties, and incorrect
usage might yield incorrect results.
Q) Describe Matrix and Vector Multiplication by Map Reduce?
Matrix-Vector Multiplication:
1. Map Phase:
• Each map task receives a portion of the matrix and the vector as input.
• The map function computes partial products for each row of the matrix and the
vector elements that correspond to the columns.
• The output of the map phase is a set of key-value pairs, where the key represents the
index of the resulting vector element, and the value is the partial product of the
corresponding matrix row and vector element.
• Intermediate key-value pairs generated by map tasks are shuffled and sorted based
on keys (resulting vector index) to prepare for the reduce phase.
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the same resulting vector
index.
• The reduce function aggregates and sums the partial products to compute the final
elements of the resulting vector.
Matrix-Matrix Multiplication:
1. Map Phase:
• Each map task is responsible for computing partial products of a portion of the input
matrices.
• The map function processes rows of one matrix and columns of the other matrix to
calculate partial products.
• Output key-value pairs consist of row-column indices of the resulting matrix element
and the corresponding partial product.
• Intermediate pairs generated by map tasks are shuffled and sorted based on keys
(resulting matrix indices) to prepare for the reduce phase.
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the same resulting matrix
index.
• The reduce function aggregates and sums the partial products to compute the final
elements of the resulting matrix.
Q) Explain Computing Selection and Projection by Map Reduce?
Selection Operation:
Selection is the process of filtering data based on specific conditions or criteria. In MapReduce,
performing selection involves applying a filter to select only the data that meets certain conditions.
1. Map Phase:
• The map function evaluates the selection condition for each record in the dataset.
• For records that satisfy the condition, the map function emits key-value pairs with a
specified key (e.g., a unique identifier) and the record itself as the value.
• Intermediate key-value pairs generated by map tasks are shuffled and sorted based
on keys to prepare for the reduce phase.
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the specified key.
• The reduce function collects and outputs only the records that satisfy the selection
condition, discarding others.
Projection Operation:
Projection involves selecting specific attributes or columns from a dataset while discarding others. In
MapReduce, performing projection means extracting and transforming data to retain only the
desired attributes.
1. Map Phase:
• The map function extracts the specified attributes or columns from each record.
• It emits key-value pairs with a key (e.g., a unique identifier) and the extracted
attributes as the value.
• Intermediate key-value pairs generated by map tasks are shuffled and sorted based
on keys to prepare for the reduce phase.
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the specified key.
• The reduce function collects and outputs the desired attributes or columns from the
records, discarding the rest.
Q) Explain Computing Grouping and Aggregation by Map Reduce?
Grouping Operation:
Grouping involves organizing data based on a specific attribute or key, creating groups of records that
share the same attribute value.
1. Map Phase:
• The map function extracts the grouping key from each record and emits key-value
pairs, where the key represents the grouping key, and the value is the entire record.
• Intermediate key-value pairs generated by map tasks are shuffled across the network
and sorted based on keys (grouping keys).
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the same grouping key.
• The reduce function collects all records that share the same grouping key, forming
groups of records based on the key.
Aggregation Operation:
Aggregation involves performing computations (such as sum, count, average) on the values within
each group formed by the grouping operation.
• The Combiner function aggregates values within each group, reducing the volume of
data to be transferred to reducers.
3. Reduce Phase:
• Reduce tasks receive intermediate pairs associated with the same grouping key.
• The reduce function aggregates values within each group by performing the
specified aggregation operation (e.g., sum, count, average) on the values.
Q) Short note on sorting and natural join?
Sorting in MapReduce:
Sorting in MapReduce involves arranging data in a specified order (ascending or descending) based
on certain criteria, typically keys or attributes. It follows a three-step process:
1. Map Phase:
• Each map task processes a portion of the dataset and emits key-value pairs, where
the key represents the attribute used for sorting, and the value is the corresponding
data.
• Intermediate key-value pairs generated by map tasks are shuffled and sorted based
on keys across the cluster, grouping records with the same key together.
• In some cases, a dedicated reduce phase might not be necessary for sorting, as the
shuffle and sort phase effectively orders the data across partitions.
Sorting is essential for tasks such as preparing data for efficient searching, organizing data for
analysis, or enabling subsequent processing steps that require ordered data.
A natural join combines records from two datasets based on a common attribute, retaining only
those records that have matching values in the specified attribute.
1. Map Phase:
• Each map task processes segments of both datasets, extracting the common
attribute (join key) from each record.
• Key-value pairs are emitted, where the key is the join attribute, and the value is the
entire record.
• Intermediate key-value pairs from map tasks are shuffled and sorted based on join
attributes across the network.
3. Reduce Phase:
• Reduce tasks receive pairs associated with the same join attribute and merge records
from both datasets sharing the same join attribute, forming the resulting joined
dataset.
Natural joins in MapReduce are used to integrate data from different sources based on shared
attributes, creating a unified dataset.
Q) Explain in brief the various features of NoSQL.?
Q) What is CAP theorem? How it is different from ACID?
• Scope:
• Trade-offs:
• CAP theorem deals with trade-offs between consistency, availability, and partition
tolerance in distributed systems, indicating that it is impossible to achieve all three
simultaneously in case of network partitions.
• ACID properties ensure the reliability, correctness, and transactional integrity within
a single database, not considering distributed scenarios or trade-offs between
availability and consistency in case of failures.
Q) Diff btwn Relational and NoSql?
Q) What are the different types of NoSQL datastores? Explain each one in short.?
Q) Explain what is Sharding along with its advantages.?
Sharding is a database partitioning technique used to horizontally partition data across multiple
servers or nodes in a distributed system. It involves dividing a large database into smaller, more
manageable subsets called shards, and distributing these shards across different nodes in a cluster.
Each shard contains a subset of the dataset.
1. Data Partitioning:
• The dataset is partitioned based on a chosen sharding key or criteria (e.g., user ID,
geographic location).
• Each shard contains a subset of the data, and no two shards hold the same data.
• Each node manages one or more shards, and together they form a distributed
database.
3. Query Routing:
• When a query is made, the system determines the relevant shard(s) based on the
sharding key.
• The query is directed to the specific node(s) holding the required shard(s).
Advantages of Sharding:
1. Scalability:
• Horizontal scalability: Allows databases to scale out by adding more nodes as data
volume increases.
• Enables handling larger datasets and higher throughput by distributing the load
across multiple nodes.
2. Performance Improvement:
• Improved read/write performance: Distributing data across shards reduces the data
volume per node, enhancing read and write operations' efficiency.
• Load balancing: Distributes workload evenly across nodes, preventing hotspots and
bottlenecks.
• Fault isolation: A failure in one shard or node does not affect the entire database,
only impacting the subset of data in that shard.
4. Cost-Efficiency:
• Uses commodity hardware: Sharding allows the use of cheaper, commodity
hardware for individual nodes instead of relying on expensive high-end servers.
1. Web Applications:
• User Profiles and Session Management: Storing user profiles, preferences, session data, and
user-generated content in document-oriented or key-value stores (e.g., MongoDB, Redis).
2. E-commerce:
• Social Media Platforms: Storing and managing user-generated content, social connections,
interactions, and activity feeds using document-oriented or graph databases.
• Real-time Analytics: Analyzing social network structures and behaviors using graph
databases for insights and trend analysis.
4. Gaming:
• Player Data and Leaderboards: Storing player profiles, game state, leaderboards, and game-
related information using NoSQL databases for high-throughput and low-latency operations.
• In-game Analytics: Tracking and analyzing user behavior, game events, and telemetry data to
improve game mechanics and user experience.
• Sensor Data and Telemetry: Handling high volumes of time-series data, sensor readings, and
telemetry data generated by IoT devices using NoSQL databases for efficient storage and
analysis.
• Data Warehousing: Storing and managing large-scale datasets, logs, and semi-structured
data for analytics and reporting purposes.
• Real-time Analytics: Processing and analyzing streaming data for real-time insights and
decision-making.
7. Financial Services:
• Fraud Detection: Analyzing transactional data and patterns in real-time to detect fraudulent
activities and anomalies.
• Risk Management: Handling large volumes of financial data, market data, and risk analytics
for better risk assessment and management.
8. Healthcare:
• Electronic Health Records (EHR): Storing patient data, medical records, and imaging data in
structured or unstructured formats for easy access and analysis.
• Healthcare IoT: Managing data from connected medical devices, wearables, and healthcare
IoT devices.
Q) . Explain in detail Shared nothing architecture.?
Key Characteristics:
1. Independent Nodes:
• Each node operates as an independent unit, with its own CPU, memory, and storage
resources.
• Nodes do not share memory or storage resources directly with other nodes.
2. No Shared Resources:
3. Data Partitioning:
• Data is partitioned or sharded across nodes, with each node responsible for
managing a subset of the overall data.
• Each node holds only a portion of the dataset, and data is distributed across nodes
based on a chosen partitioning scheme (e.g., hashing, range-based partitioning).
4. Scalability:
• Horizontal scalability: New nodes can be added to the system to increase processing
power and storage capacity.
5. Fault Tolerance:
• Improved fault tolerance: Failure of one node does not affect the overall system, as
other nodes continue to function independently.
• Redundancy and replication of data across nodes ensure data availability and
resilience against node failures.
6. Parallel Processing:
The major components of the HBase data model include the following:
1. Table:
• Explanation:
2. Row:
• Explanation:
• Each row has a unique row key that identifies and organizes the data.
• Row keys are sorted lexicographically, and rows with similar row keys are stored
close together in HBase.
3. Column Family:
• Explanation:
• Each column family consists of one or more columns and is defined at the table level.
• Data in a column family is physically stored together in HBase, and each column
family must be defined when the table is created.
4. Column Qualifier:
• Explanation:
• They are combined with the column family name to uniquely identify a particular cell
in HBase.
• Cells are addressed using the combination of row key, column family, and column
qualifier.
5. Cell:
• Explanation:
• A cell represents the intersection of a row, column family, and column qualifier.
• It stores the actual data and a timestamp indicating when the data was written or
updated.
6. Timestamp:
• Explanation:
• Timestamps are associated with each version of data, allowing retrieval of specific
versions based on the timestamp.
7. Region:
• Explanation:
• HBase tables are divided into regions, which are contiguous ranges of rows.
• Regions are the unit of distribution and load balancing in HBase, and they are
managed and served by individual region servers.
Q) Explain Compaction along with its types?
In Apache HBase, compaction is a process used to consolidate and optimize data storage by merging
and compacting smaller data files (HFiles) into larger ones. It helps in improving read and write
performance, reducing storage overhead, and managing disk space efficiently. HBase performs
compaction regularly as part of its internal maintenance tasks.
1. Minor Compaction:
• Explanation:
• Minor compaction merges a small number of smaller HFiles from a single HBase
region into fewer, larger files.
• Purpose:
2. Major Compaction:
• Explanation:
• Major compaction merges all HFiles within an HBase region, regardless of size, into a
single HFile.
• Major compaction releases disk space by removing redundant data and old versions,
and it's more resource-intensive than minor compaction.
• Purpose:
• Reclaims disk space by removing deleted or expired data versions (TTL expired).
• Improves read and write performance by reducing disk seek time and optimizing
storage.
Q) HBase write operation?
Q) HBase Read Mechanism?
The read mechanism in Apache HBase involves retrieving data stored in the database. The process of
reading data in HBase includes several steps:
1. Client Request:
• HBase API: The client uses the HBase API to interact with the HBase cluster for data retrieval.
• HMaster and ZooKeeper: The client contacts ZooKeeper to discover the location of the
RegionServers and the HMaster node.
• Determining Region:
• HBase determines the target region based on the row key specified in the read
request.
• HBase checks its Block Cache (in-memory cache) to see if the requested data is
available in memory (cache).
• If the data is found in the Block Cache, it's directly returned to the client, improving
read performance.
4. HFile Read:
• If the data is not in the Block Cache, HBase reads the relevant HFiles from the
Hadoop Distributed File System (HDFS).
• HBase reads the required HFile blocks (containing the relevant row key ranges) from
HDFS into memory.
• MemStore Merge: If there are unflushed MemStore contents, HBase performs an in-
memory merge of MemStore data with the read data from HFiles.
• Filtering: Applies filters if specified in the read request to further refine the data.
6. Response to Client:
• Data Retrieval: HBase compiles the required data from memory, HFiles, and filters, and
sends it back to the client as the response to the read request.
• Acknowledgment: The client receives the requested data or results based on the read
operation.
Q) What is HIVE? Explain its architecture?
Apache Hive is a data warehousing infrastructure built on top of Apache Hadoop for querying and
analyzing large datasets stored in Hadoop's distributed file system (HDFS). It provides a SQL-like
interface called HiveQL to perform queries, making it easier for users familiar with SQL to work with
Hadoop.
• SQL-like Language:
• HiveQL allows users to write queries using SQL-like syntax to analyze and process
data.
2. Metastore:
• Metadata Repository:
• Hive uses a Metastore that stores metadata (table schemas, column types,
partitions) in a relational database (e.g., MySQL, Derby) separate from HDFS.
• Contains information about tables, partitions, columns, storage formats, and their
mappings to HDFS files.
3. Driver:
• The driver receives HiveQL queries, compiles them into MapReduce, Tez, or Spark
jobs, and submits them to the respective execution engine.
4. Execution Engine:
• Storage:
6. User Interface:
• Users query and manipulate data using HiveQL similar to SQL operations (SELECT,
INSERT, UPDATE).
• The driver compiles HiveQL queries into MapReduce, Tez, or Spark jobs.
• The execution engine processes jobs by distributing tasks across the Hadoop cluster.
5. Results Retrieval:
• Provides SQL-like querying capabilities for Hadoop, enabling easy data analysis for users
familiar with SQL.
• Explanation:
• The Warehouse Directory in Apache Hive refers to the location within Hadoop's
distributed file system (HDFS) where Hive stores its data files.
• It serves as the default base directory where tables and their data are stored.
• Purpose:
• Storage Location: Acts as the default storage location for tables created in Hive
unless specified otherwise during table creation.
• Data Organization: Organizes tables, partitions, and associated data files within
HDFS.
• Configuration:
• Usage:
• Table Storage: When tables are created in Hive, their data files (such as ORC,
Parquet, or text files) are stored within the warehouse directory by default.
• Data Retrieval: Hive queries and operations access data stored in tables located
within this directory.
• Explanation:
• The Metastore in Apache Hive serves as a central metadata repository that stores
schema information, table definitions, column details, and storage location
mappings.
• It maintains metadata information about Hive tables, partitions, columns, and their
corresponding HDFS locations.
• Purpose:
• Schema and Table Definitions: Holds information about table schemas, column
types, data formats, and storage locations.
• Functionality:
• Usage:
• Query Optimization: Hive uses metadata stored in the Metastore to optimize query
planning and execution.
• Table Management: Facilitates the creation, alteration, and deletion of tables and
partitions in Hive.
Q) What is HIVE query language? Explain Built in functions in HIVE.?
Hive Query Language (HiveQL) is a SQL-like language used to query and manipulate data stored in
Apache Hive. It provides a familiar SQL-like interface for users to interact with Hadoop's distributed
file system (HDFS) through Hive. Some characteristics of HiveQL include:
• SQL-like Syntax: HiveQL syntax resembles SQL, making it accessible to users familiar with
traditional relational databases.
• Data Definition and Manipulation: Supports standard SQL operations for data definition
(DDL), manipulation (DML), querying, and analysis.
id INT,
name STRING,
age INT
STORED AS TEXTFILE;
Hive provides a wide range of built-in functions that users can leverage within HiveQL queries to
perform various transformations, calculations, and data manipulations. These functions are
categorized into different types based on their functionalities:
• Aggregate Functions: Operate on multiple rows and return a single result (e.g., SUM, AVG,
COUNT).
• Date and Time Functions: Handle date and time-related operations (e.g., YEAR, MONTH,
DAY, UNIX_TIMESTAMP).
• Conditional Functions: Implement conditional logic (e.g., CASE, COALESCE, IF, NULLIF).
• Collection Functions: Work with collections like arrays, maps, and structs (e.g.,
ARRAY_CONTAINS, MAP_KEYS, STRUCT).
Q) What is PIG? Explain its architecture in detail.?
Apache Pig is a high-level platform used for analyzing large datasets in Hadoop using a language
called Pig Latin. It simplifies the process of writing MapReduce programs by providing a more
intuitive and expressive way to handle data transformations and analysis on Hadoop.
• Scripting Language:
• Users write data processing programs in Pig using the Pig Latin scripting language.
• Pig Latin scripts define the data flow, transformations, and operations to be
performed on the dataset.
• Compilation Process:
• Pig Latin scripts are processed by the Pig Latin Compiler, which translates them into a
series of MapReduce jobs.
• It generates an execution plan called the Directed Acyclic Graph (DAG) representing
the logical and physical operators for the data processing steps.
3. Execution Modes:
• Local Mode:
• Suitable for development and testing, runs Pig on a single machine using local data.
• MapReduce Mode:
• Relational Operators:
• LOAD: Loads data into Pig from various sources (e.g., HDFS, local file system, HBase).
• STORE: Saves the results of processing back to HDFS or other storage systems.
• FILTER, GROUP, JOIN, FOREACH, etc.: Perform data transformations and operations
on datasets similar to SQL-like operations.
• Execution Framework:
• Manages the execution of Pig scripts and the generation of MapReduce jobs.
• Interfaces with Hadoop MapReduce or other execution engines for distributed data
processing.
6. Hadoop Cluster:
• Underlying Framework:
• Pig runs on top of the Hadoop ecosystem, leveraging HDFS for storage and
MapReduce for distributed computation.
• Utilizes the resources of a Hadoop cluster to execute Pig Latin scripts in a distributed
and parallel manner.
• Users write Pig Latin scripts to describe the data flow and transformations needed.
2. Compilation:
• The Pig Latin Compiler compiles the scripts into an execution plan (DAG).
3. Execution Planning:
• The execution plan outlines the sequence of MapReduce jobs needed for data
processing.
4. Execution:
• Pig runtime environment executes the generated MapReduce jobs on the Hadoop
cluster.
5. Results Retrieval:
• Processed results are stored in HDFS or other storage systems as specified in the
script.
• Provides a high-level abstraction to write complex data processing tasks more easily.
• Supports a wide range of data sources and integrates with other Hadoop ecosystem tools.
Q) Built in functions of pig?
Eval Functions:
• COUNT_STAR: Counts the total number of records or tuples in a relation, including null
values.
• MAX: Retrieves the maximum value from a set of values, useful in finding the highest value
in a dataset.
• MIN: Retrieves the minimum value from a set of values, useful for finding the lowest value in
a dataset.
• SIZE: Determines the size or length of a bag, tuple, or map, returning the number of
elements.
• PigStorage(): A default function that loads and stores data using delimiters, such as comma
or tab, allowing users to define their own schema.
• TextLoader: Loads data from text files into Pig, each line of the file becomes a record.
• HBaseStorage: Enables loading and storing data from/to Apache HBase tables, leveraging
HBase data within Pig.
• JsonLoader: Loads JSON-formatted data, converting JSON objects into Pig tuples.
Math Functions:
• COS, SIN, TAN: Mathematical trigonometric functions (cosine, sine, tangent) operating on
numeric inputs.
• RANDOM: Generates a random number within a specified range or with a specific seed
value.
String Functions:
• SUBSTRING: Extracts a substring from a string based on specified starting and ending indices.
• LOWER: Converts a string to lowercase.
DateTime Functions:
• GetDay, GetHour, GetYear: Extract specific components (day, hour, year) from a datetime
value.
1. Topics:
• Kafka stores data in topics, which are streams of records categorized by a specific
name.
• Producers write data to topics, and consumers read from these topics.
2. Partitions:
• Topics can be divided into partitions, allowing data to be distributed across multiple
brokers for scalability and parallel processing.
3. Brokers:
• Brokers are individual Kafka server instances responsible for storing data and serving
client requests.
4. Producers:
• They send records/messages to Kafka topics that are then stored in the assigned
partitions.
5. Consumers:
• Applications or services that subscribe to topics and process the published records.
• Consumers read data from partitions and can process it in real-time or at their own
pace.
6. Offsets:
• Offsets allow consumers to track the messages they have read within a partition.
Kafka Architecture:
2. Brokers:
• Responsibilities: Store and manage the partitions, handle client requests, and replicate data
across brokers.
3. ZooKeeper:
• Coordination Service: Kafka uses ZooKeeper for managing and coordinating brokers, leader
election, and metadata storage.
• Consumers: Applications that subscribe to topics and consume the published data.
• Offsets: Sequential IDs assigned to messages within a partition, allowing consumers to keep
track of their read position.
6. Replication:
• Data Redundancy: Kafka replicates partitions across multiple brokers to ensure fault
tolerance and data durability.
• Leader-Follower Replication: Each partition has one leader and multiple follower replicas.
Q) Workflow of PUB-SUB?
Q) Queue Messaging / Consumer group?
Q) Explain Cluster Architecture Components.?
In Apache Kafka, the cluster architecture comprises several key components that collectively form
the infrastructure for handling distributed event streaming. Here are the core components within a
Kafka cluster:
Kafka Broker:
• Definition: Kafka clusters consist of multiple broker nodes. Each broker is a Kafka server
responsible for handling data storage, processing, and client requests.
• Role: Brokers manage topics, partitions, and message replication within the cluster.
• Responsibilities:
ZooKeeper:
• Coordination Service: Kafka relies on ZooKeeper for managing and coordinating tasks within
the Kafka cluster.
• Role in Kafka:
Topics:
• Logical Channels: Kafka stores data in topics, which represent a category or stream of
records.
• Partitions:
• Topics are divided into partitions, allowing parallel processing and scalability.
• Each partition can be replicated across multiple brokers for fault tolerance.
Producers:
• Data Publishers: Producers are applications or processes responsible for publishing data to
Kafka topics.
• Send Records: They send records/messages to Kafka brokers, specifying the topic to which
data should be written.
• No Direct Interaction with Partitions: Producers don't interact directly with partitions; Kafka
handles partition assignment.
Consumers:
• Data Subscribers: Consumers subscribe to topics and retrieve records/messages from Kafka
partitions.
• Read from Partitions: Consumers read from specific partitions and can maintain their read
position using offsets.
Partitions:
• Division of Topics: Topics are divided into partitions, which are ordered, immutable
sequences of records.
• Parallel Processing: Partitions allow for parallelism in data processing and distribution across
brokers.
Replication:
• Data Redundancy: Kafka replicates topic partitions across multiple brokers for fault tolerance
and data durability.
• Leader-Follower Model: Each partition has one leader and multiple follower replicas to
ensure availability and reliability.
Q) What is apache spark?
Q) Explain working with RDDs in Spark.?
Working with Resilient Distributed Datasets (RDDs) in Apache Spark involves several key aspects that
facilitate distributed data processing. Here's an explanation based on the information provided:
1. Creation of RDDs:
2. Transformations on RDDs:
• Map, Filter, FlatMap: RDDs support functional transformations such as map, filter,
and flatMap for data manipulation.
• Reduce, GroupBy, SortBy: Aggregation operations like reduce, groupBy, and sortBy
allow for data aggregation and sorting within RDDs.
3. Actions on RDDs:
• Count, Collect, Reduce: Actions like count, collect, and reduce trigger computations
on RDDs and retrieve results.
• Take, First, SaveAsTextFile: Other actions such as take, first, and saveAsTextFile
allow fetching data or storing RDD contents.
4. RDD Lineage and Lazy Evaluation:
6. RDD Persistence:
• RDDs facilitate in-memory processing, addressing slow data sharing issues present in
traditional MapReduce frameworks.
• Support for iterative algorithms and interactive queries is enhanced due to RDDs'
capability to store data in memory.
Q) Explain Spark Framework?
1. Spark Core:
2. Spark SQL:
• Module enabling interaction with structured and semi-structured data using SQL
queries or DataFrame APIs, bridging SQL capabilities with Spark's distributed
processing.
3. Spark Streaming:
• A library that houses various machine learning algorithms and utilities for scalable
machine learning tasks, enabling efficient data analysis and model training.
5. GraphX:
• Graph processing library providing functionalities for analyzing and processing graph
data structures, facilitating graph-based computations.
• APIs for using Spark with R and Python languages, respectively, allowing developers
to leverage Spark capabilities within their preferred language environment.
Key Advantages:
1. Speed:
• Offers high-level abstractions and APIs in multiple languages (Scala, Java, Python, R),
simplifying development and deployment of applications.
3. Unified Framework:
4. Scalability:
• Scales efficiently with parallel processing across clusters, making it suitable for
handling large-scale datasets and diverse workloads.
Q) Explain Spark SQL and Data Frames?
Spark SQL is a module in Apache Spark that provides optimized support for working with structured
and semi-structured data. It introduces a higher-level interface compared to the traditional RDD-
based API, allowing users to execute SQL queries, perform DataFrame operations, and access
structured data using familiar SQL syntax. Spark SQL seamlessly integrates relational processing
capabilities with Spark's distributed computing engine.
1. DataFrame:
2. SQL Queries:
• Spark SQL allows the execution of SQL queries against DataFrame structures and
external databases.
1. DataFrames:
• DataFrames are immutable and support various transformations and actions (e.g.,
select, filter, groupBy, join) similar to SQL operations.
2. Schema Inference:
• Spark SQL can automatically infer the schema of structured data files (like JSON, CSV,
Parquet) to create DataFrames without requiring explicit schema definitions.
• Seamlessly integrates with Spark's MLlib (Machine Learning Library) and GraphX for
machine learning and graph processing tasks.
4. Performance Optimization:
• Spark SQL optimizes queries using Catalyst Optimizer and Tungsten Execution Engine,
enabling efficient query execution and leveraging Spark's distributed computing
capabilities.
5. Connectivity:
• Supports connections to various data sources such as Hive, Avro, Parquet, JDBC,
ORC, JSON, and more.
Q) Explanation of data visualization?
Data visualization is the graphical representation of data and information using visual elements such
as charts, graphs, maps, and other visual tools. It allows individuals to understand complex datasets,
trends, patterns, and relationships within the data by presenting it in a more accessible and
understandable format.
2. Decision Making:
3. Communication:
4. Identification of Relationships:
5. Storytelling:
• Allows data analysts to tell a compelling story using visuals, engaging stakeholders
and conveying the message effectively.
• Line charts, bar graphs, pie charts, histograms, scatter plots, etc., are used to
represent numerical data.
2. Maps:
3. Dashboards:
4. Infographics:
• Visual representations that combine text, images, and data for quick comprehension.
5. Heatmaps:
1. Interactive Visualizations:
3. Drag-and-Drop Interface:
• Offers various chart types, graphs, maps, and other visual elements to represent
data effectively.
• Allows users to blend and join multiple data sources seamlessly, facilitating
comprehensive analysis.
6. Dashboard Creation:
Advantages of Tableau:
1. Ease of Use:
• Offers interactive features like filters, drill-downs, and tooltips for in-depth
exploration of data.
4. Scalability:
• A large user community and extensive resources, including forums, online courses,
and tutorials, support users in learning and problem-solving.