0% found this document useful (0 votes)
19 views44 pages

BDA CIA 2 IMP Questions

The document provides a comprehensive question bank covering various topics related to MongoDB, Hadoop, MapReduce, and R programming. It includes explanations of key concepts such as replication, sharding, data locality, and operations in MongoDB, as well as practical examples for inserting, updating, and deleting data. Additionally, it discusses the advantages of MongoDB's document structure over traditional RDBMS and highlights the integration of R with Hadoop for data analysis.

Uploaded by

ilakkiyanj01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views44 pages

BDA CIA 2 IMP Questions

The document provides a comprehensive question bank covering various topics related to MongoDB, Hadoop, MapReduce, and R programming. It includes explanations of key concepts such as replication, sharding, data locality, and operations in MongoDB, as well as practical examples for inserting, updating, and deleting data. Additionally, it discusses the advantages of MongoDB's document structure over traditional RDBMS and highlights the integration of R with Hadoop for data analysis.

Uploaded by

ilakkiyanj01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

MODULE 2&3 QUESTION BANK

Two marks

1. Point out Replication and scaling features of MongoDB.

Replication and Scaling Features of MongoDB:


1. Replication: MongoDB uses Replica Sets, where multiple copies of data are
maintained across different servers. This ensures high availability and data
redundancy, allowing automatic failover in case of server failure.
2. Scaling: MongoDB supports Sharding, a horizontal scaling mechanism that
distributes data across multiple servers (shards). This enables the handling of large
datasets and high-throughput operations efficiently.

2. Show how sharding is done in big data

Sharding in Big Data


Sharding is the process of partitioning large datasets across multiple servers to ensure
efficient querying and load balancing. Here’s how it is done:
1. Choose a Shard Key: A field in the document is selected to distribute data (e.g., _id,
date, location).
2. Data Distribution: Data is divided into chunks based on the shard key and
distributed across multiple shards (servers).
3. Query Routing: A mongos router directs queries to the appropriate shard based on
the shard key.
4. Balancing: MongoDB’s balancer ensures an even data distribution across shards.

3. Write about Data Locality in MapReduce.

Data Locality in MapReduce


Data Locality in MapReduce refers to the concept of processing data close to where it is
stored to reduce network congestion and improve performance.
• Hadoop sends computation to the node where the data resides instead of moving
large datasets across the network.
• This minimizes I/O overhead and network latency, making the MapReduce job
more efficient.
• The Hadoop Distributed File System (HDFS) ensures that data blocks are distributed
across nodes, allowing task trackers to assign map tasks to nodes storing the
required data.

4. Differentiate between MongoDB and traditional RDBMS in terms of data storage.

5. Generalize the term Record Reader/Writer.

Record Reader/Writer in Big Data


1. Record Reader: In Hadoop’s MapReduce, a Record Reader converts input data from
HDFS blocks into key-value pairs for processing by the Mapper. It reads raw data
and splits it logically rather than physically.
2. Record Writer: After processing, the Record Writer takes the key-value pairs
produced by the Reducer and writes them back to HDFS or another storage system
in the desired format.
6. How to insert a document in Mongo DB?

Inserting a Document in MongoDB


To insert a document into a MongoDB collection, use the insertOne() or insertMany()
method.
Example: Insert a Single Document
db.students.insertOne({
name: "John Doe",
age: 22,
course: "Computer Science"
});
Example: Insert Multiple Documents
db.students.insertMany([
{ name: "Alice", age: 21, course: "Mathematics" },
{ name: "Bob", age: 23, course: "Physics" }
]);

7. How to update data in a document in Mongo DB by using the update command in


MongoDB with suitable example?

Updating Data in a MongoDB Document


In MongoDB, the updateOne() and updateMany() methods are used to update documents.
Example: Updating a Single Document

db.students.updateOne(
{ name: "John Doe" }, // Filter condition
{ $set: { age: 23 } } // Update operation
);
Example: Updating Multiple Documents
db.students.updateMany(
{ course: "Computer Science" }, // Filter condition
{ $set: { status: "Graduated" } } // Update operation
);
8. Infer how you can manage compute node failures in Hadoop.

Managing Compute Node Failures in Hadoop


Hadoop handles compute node failures through fault tolerance mechanisms:
1. Task Reassignment: When a node fails, the JobTracker (Hadoop v1) or
ResourceManager (Hadoop v2) reassigns the failed task to another healthy node.
2. Speculative Execution: Hadoop detects slow or failing tasks and runs duplicate tasks
on different nodes to ensure job completion.
3. Data Replication: Hadoop’s HDFS replicates data blocks across multiple nodes
(default 3 replicas) to prevent data loss if a node crashes.
4. Heartbeat Monitoring: Hadoop constantly monitors nodes using heartbeats; if a
node stops responding, it is marked as dead, and its tasks are reassigned.

9. Interpret the phases of Map and Reduce task.

Phases of Map and Reduce Task in Hadoop


1. Map Phase
• Input Splitting: The input data is split into chunks.
• Mapping: The Mapper processes each split and converts it into key-value pairs.
• Partitioning: Data is divided based on keys for efficient processing.
• Shuffling & Sorting: Intermediate key-value pairs are sorted and grouped by key.
2. Reduce Phase
• Grouping: The sorted data is grouped by key.
• Reducing: The Reducer processes grouped data to generate the final output.
• Writing Output: The final result is written back to HDFS.
10. Distinguish between Mapper and Reducer in a MapReduce job.

11. Analyze the role of Combiner in MapReduce programming.

Role of Combiner in MapReduce Programming


A Combiner in MapReduce is a local, mini-reducer that helps optimize performance by
reducing the amount of data transferred between the Map and Reduce phases.
1. Data Aggregation: The Combiner processes the output of the Mapper locally,
aggregating or summarizing data (e.g., summing values, counting occurrences)
before sending it to the Reducer.
2. Reduce Network Load: By applying a reduction operation on data at the Mapper
level, the Combiner reduces the amount of data that needs to be shuffled and
transferred over the network to the Reducer, improving overall efficiency.

12. Write a query to create and drop database in Hive.

Hive Query to Create and Drop a Database


Create a Database in Hive
CREATE DATABASE my_database;

Drop a Database in Hive


DROP DATABASE my_database;
13. Apply the concept of partitioning to distribute data in a MapReduce job.

Applying Partitioning to Distribute Data in a MapReduce Job


In MapReduce, partitioning is the process of distributing the output of the Mapper across
multiple Reducers based on the key. The partitioning ensures that all records with the
same key are sent to the same Reducer.
1. Partitioning Key: The partitioner uses the key of the key-value pair produced by the
Mapper to determine which Reducer will process the data.
2. Custom Partitioning: By default, the Hadoop partitioner uses a hash function to
assign keys to Reducers. However, you can implement a custom partitioner to
control data distribution, allowing for better load balancing and optimized
processing.
Example of Custom Partitioner:
public class MyCustomPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
// Example: Partition based on the first character of the key
return key.toString().charAt(0) % numPartitions;
}
}
14. Identify the different Hive datatypes and their uses.

Primitive Data Types: These are the most common data types in Hive, used for simple data
values.
Complex Data Types: These types can store more structured data.

15. Relate how to read external data into R from different file formats

Reading External Data into R from Different File Formats


CSV Files
o Function: read.csv()
o Example: data <- read.csv("file.csv")
Excel Files
• Function: readxl::read_excel()
• Example:
library(readxl)
data <- read_excel("file.xlsx")

Text Files (Tab-delimited)


• Function: read.table()
• Example: data <- read.table("file.txt", sep = "\t", header = TRUE)

JSON Files
• Function: jsonlite::fromJSON()
• Example: library(jsonlite)
data <- fromJSON("file.json")
16. Highlight the key differences between MapReduce and Apache Pig.

17. Give the Data types in Hive.

Here are some data types in Hive:


1. TINYINT
2. SMALLINT
3. INT
4. BIGINT
5. FLOAT
6. DOUBLE
7. STRING
8. BOOLEAN
9. DATE
10. TIMESTAMP
18. Evaluate the integration of R with Hadoop for data analysis.

Integration of R with Hadoop for Data Analysis


1. Data Processing at Scale: Hadoop provides a distributed environment for storing
and processing large datasets, while R is powerful for statistical analysis and data
visualization. By integrating the two, R can be used to analyze large datasets stored
in Hadoop's HDFS (Hadoop Distributed File System) efficiently.
2. Using RHadoop: RHadoop is a collection of R packages (like rmr2, rhdfs, rhbase)
that facilitate the integration of R with Hadoop. With RHadoop, R can access and
process data directly from HDFS, run MapReduce jobs, and analyze large-scale data
in a parallel and distributed manner.

19. Write a program in R language to print prime number from 1 to given number.

# Function to check if a number is prime


is_prime <- function(n) {
if (n <= 1) return(FALSE)
for (i in 2:sqrt(n)) {
if (n %% i == 0) return(FALSE)
}
return(TRUE)
}

# Function to print prime numbers up to a given number


print_primes <- function(limit) {
for (i in 2:limit) {
if (is_prime(i)) {
print(i)
}
}
}

# Example: Print prime numbers from 1 to 30


print_primes(30)

20. Mention the Types of Iterative Programming in R.

Types of Iterative Programming in R


1. For Loop
o Used for repeating a block of code a fixed number of times.
o Syntax:
for (i in 1:10) {
# code to be executed
}

While Loop
• Repeats a block of code as long as a specified condition is TRUE.
• Syntax:
while (condition) {
# code to be executed
}

Repeat Loop
• Executes an infinite loop until a break condition is met.
• Syntax:

repeat {
# code to be executed
if (condition) break
}
1. Explain the update and delete operations in MongoDB Query Language.

Update and Delete Operations in MongoDB Query Language (MQL)


MongoDB provides powerful methods to update and delete data in a database. The update
and delete operations are essential for managing and maintaining data integrity. Below is an
explanation of both operations, their syntax, and common use cases.

1. Update Operations in MongoDB


The update operation in MongoDB modifies existing documents in a collection. It allows you to
change values of specific fields, add new fields, or remove existing ones.
Types of Update Operations:
1. updateOne():
o Updates one document that matches the filter criteria.
o If multiple documents match the filter, only the first one is updated.
Syntax:

db.collection.updateOne(
<filter>,
<update>,
{ <options> }
);

Example:
db.employees.updateOne(
{ _id: 1 }, // Filter: match the document with _id 1
{ $set: { salary: 60000 } } // Update: set the salary field to 60000
);

updateMany():
• Updates multiple documents that match the filter criteria.
Syntax:
db.collection.updateMany(
<filter>,
<update>,
{ <options> }
);
Example:
db.employees.updateMany(
{ department: "HR" }, // Filter: match all documents where department is "HR"
{ $set: { salary: 50000 } } // Update: set the salary field to 50000 for all matching employees
);
replaceOne():
• Replaces the entire document that matches the filter with a new document.
Syntax:
db.collection.replaceOne(
<filter>,
<replacement>,
{ <options> }
);

Example:
db.employees.replaceOne(
{ _id: 1 }, // Filter: match the document with _id 1
{ _id: 1, name: "John Doe", department: "Engineering", salary: 70000 } // Replace document
entirely
);

Update Operators:
MongoDB provides several operators that can be used with update operations:
• $set: Sets the value of a field.
• $inc: Increments the value of a field.
• $push: Adds an element to an array.
• $addToSet: Adds an element to an array only if it doesn't already exist.
• $unset: Removes a field from a document.
• $rename: Renames a field.

2. Delete Operations in MongoDB


The delete operation is used to remove documents from a collection. There are two main
methods for deleting documents in MongoDB.
Types of Delete Operations:
1. deleteOne():
o Deletes one document that matches the filter criteria.
Syntax: db.collection.deleteOne(<filter>);

Example: db.employees.deleteOne({ _id: 1 }); // Deletes the document with _id 1

deleteMany():
• Deletes multiple documents that match the filter criteria.
Syntax: db.collection.deleteMany(<filter>);
Example: db.employees.deleteMany({ department: "HR" });

2. Discuss the advantages of using MongoDB's JSON-like document


structure for storing employee records over a traditional relational
database schema

Advantages of Using MongoDB's JSON-like Document Structure for Storing Employee


Records Over a Traditional Relational Database Schema
MongoDB's JSON-like document structure (BSON - Binary JSON) offers several advantages
over traditional relational database schemas for storing employee records. These benefits
stem from the flexibility, scalability, and ease of use provided by MongoDB’s document-based
approach.
Here are the key advantages:

1. Flexible Schema Design


MongoDB:
• MongoDB uses a schemaless design, meaning each document (record) can have
different fields and structures.
• For employee records, this flexibility allows you to store various data without rigid
constraints. For example, some employee records might have additional information
such as multiple phone numbers or work locations, while others might not.
• This flexibility allows quick adaptation to changing business requirements, such as
adding new fields to employee records without needing to alter the database schema.
Traditional RDBMS:
• Relational databases use a fixed schema, where all rows in a table must conform to the
same structure (columns). Any changes, like adding or removing columns, often require
modifying the table structure, which can be cumbersome and disruptive.
Example: In MongoDB, you can store an employee’s skills as an array of strings, and the
contact info as a nested document, which is much more difficult to achieve in a relational
database.

2. Easier Data Representation (Complex Structures)


MongoDB:
• With its JSON-like structure, MongoDB can natively represent complex and nested
data, which is common for employee records. For example, you can easily represent
nested information such as addresses, work experience, or departments as sub-
documents within an employee record.
• MongoDB’s embedded documents and arrays allow storing employee's related
information (like past projects, previous job roles, and certifications) together within a
single record, leading to simpler data retrieval and better organization.
Traditional RDBMS:
• Representing complex relationships, such as an employee with multiple work locations
or a history of job positions, typically requires multiple related tables and JOIN
operations. This results in complicated queries and often affects performance.
• In relational databases, this requires normalization, which may result in multiple tables
and an increased need for foreign keys and JOINs.

3. Horizontal Scalability and Performance


MongoDB:
• MongoDB is designed to scale horizontally, which means it can handle large volumes of
employee data by distributing it across multiple servers using sharding.
• The document structure and the ability to store data in a denormalized way (reducing the
need for joins) allow for faster reads and writes, especially when dealing with large
datasets like employee records in large organizations.
• MongoDB provides better performance when storing large, unstructured, or semi-
structured data, making it easier to scale as the company grows.
Traditional RDBMS:
• Relational databases typically scale vertically, which means increasing the capacity of a
single server (adding CPU, RAM, storage, etc.). Scaling horizontally (across multiple
servers) requires complex sharding or partitioning, which can be difficult to implement
and manage.
• Also, in relational databases, normalization and JOIN operations can slow down
performance as the dataset grows, especially when handling many relationships
between employee data and other entities.

4. Real-Time Analytics and Aggregation


MongoDB:
• MongoDB’s aggregation framework allows for powerful, real-time analytics directly on
employee records, such as calculating the number of employees in different
departments, average salary by region, or sorting employees by their years of service.
• It enables ad-hoc queries, which means employees’ records can be queried and
analyzed without predefined schema restrictions, allowing for more flexible and faster
decision-making.
Traditional RDBMS:
• In a relational database, similar analytics often require complex JOIN operations and
may need pre-built views or materialized views to optimize performance. The queries are
more rigid due to the fixed schema and may not be as efficient as MongoDB’s
aggregation framework for large datasets.

5. NoSQL Model for Fast Development and Iteration


MongoDB:
• MongoDB’s flexible schema design facilitates faster application development, particularly
when you need to iterate quickly. As business needs change (e.g., new employee
benefits or tracking additional employee metrics), the schema can evolve without the
overhead of modifying the underlying database structure.
• MongoDB’s support for dynamic schemas allows developers to store different kinds of
information with varying structures, making it ideal for fast-paced development
environments, such as tech startups or companies experimenting with new HR systems.
Traditional RDBMS:
• In a relational model, developers must define the schema upfront and make changes
carefully. Altering the database schema for evolving requirements (e.g., adding a new
field for "remote work status") can involve significant overhead, including database
migrations and potential downtime.

6. High Availability with Replica Sets


MongoDB:
• MongoDB supports replica sets, which provide automatic failover and data redundancy.
This means employee records are always available even if one node fails, making it
ideal for mission-critical systems where uptime is important.
• Replica sets ensure that employee data is consistently available for read/write
operations, even during server maintenance or unexpected failures.
Traditional RDBMS:
• Achieving high availability in relational databases typically involves complex
configurations and clustering technologies, such as master-slave replication or
clustering. These configurations may involve manual intervention during failovers or
backup processes.

7. Handling Unstructured Data


MongoDB:
• MongoDB excels in handling unstructured and semi-structured data (such as
employee comments, documents, or logs) alongside structured data. For example, you
can store employee feedback in a text field and use it for analysis without having to fit it
into a strict schema.
Traditional RDBMS:
• Relational databases are not well-suited for unstructured or semi-structured data. Any
unstructured data (like text or images) often requires separate tables or special handling,
adding complexity to the database schema.

8. Reduced Data Duplication and Complexity


MongoDB:
• MongoDB’s denormalized data model reduces the need for complex JOINs. Employee
data, including related data such as department information, can be embedded directly
within the document, thus avoiding the need to reference multiple tables.
Traditional RDBMS:
• In relational databases, data normalization can lead to multiple tables and frequent JOIN
operations. These operations are often more complex and slower, especially when
dealing with large datasets of employee records, where data redundancy is minimized
through normalization.

3. Describe the aggregate function in MongoDB Query Language.

Aggregate Function in MongoDB Query Language (MQL)


The aggregate function in MongoDB is a powerful and flexible tool for performing data
transformation and computation operations. It allows you to process and summarize data in a
collection by applying multiple operations such as filtering, grouping, sorting, projecting, and
joining data. MongoDB's aggregation framework processes data through a series of stages,
each of which performs a specific operation on the data, forming an aggregation pipeline.
Key Features of the Aggregate Function in MongoDB:
1. Aggregation Pipeline: The aggregation framework in MongoDB works by using an
aggregation pipeline. The pipeline is a series of stages that process documents in a
sequence. The output of one stage becomes the input for the next.
o Each stage is represented by a MongoDB aggregation operator.
o The aggregation framework processes data efficiently and can handle complex
queries involving filtering, grouping, sorting, and more.
2. Stages in Aggregation Pipeline: Each stage in the aggregation pipeline performs a
specific operation on the data. Common stages include:
o $match: Filters documents to pass only those that match a specified condition
(similar to the WHERE clause in SQL).
o $group: Groups documents by a specified field and applies aggregate functions
like sum, avg, count, etc. (similar to GROUP BY in SQL).
o $sort: Sorts the documents by specified fields.
o $project: Reshapes the document, allowing you to include or exclude specific
fields.
o $limit: Limits the number of documents passed to the next stage.
o $skip: Skips a specified number of documents.
o $unwind: Deconstructs an array field and creates a document for each element.
o $lookup: Performs a left outer join with another collection (similar to SQL joins).

Syntax of the Aggregate Function:


db.collection.aggregate([
{ $stage1: { ... } },
{ $stage2: { ... } },
...
]);
Each stage is enclosed in a {} and is separated by commas.

Commonly Used Pipeline Stages:


1. $match: The $match stage filters the documents in the collection based on the specified
criteria. It is similar to the WHERE clause in SQL.
Example:
db.sales.aggregate([
{ $match: { region: "North" } }
]);
o This query filters the sales records to include only those where the region is
"North".
2. $group: The $group stage is used to group documents by a specific field and apply
aggregate functions such as sum, avg, count, etc.
Example:
db.sales.aggregate([
{ $group: { _id: "$region", total_sales: { $sum: "$amount" } } }
]);
o This groups the sales documents by region and calculates the total sales ($sum)
for each region.
3. $sort: The $sort stage sorts the documents by specified fields.
Example:
db.sales.aggregate([
{ $sort: { amount: -1 } }
]);
o This sorts the documents by the amount field in descending order.
4. $project: The $project stage reshapes each document in the pipeline by specifying
which fields to include or exclude.
Example:
db.sales.aggregate([
{ $project: { region: 1, amount: 1, _id: 0 } }
]);
o This projects the region and amount fields, while excluding the _id field from the
result.
5. $limit: The $limit stage limits the number of documents passed to the next stage of the
pipeline.
Example:
db.sales.aggregate([
{ $limit: 5 }
]);
o This limits the result to the first 5 documents.
6. $unwind: The $unwind stage deconstructs an array field and creates a separate
document for each element in the array.
Example:
db.orders.aggregate([
{ $unwind: "$items" }
]);
o This unwinds the items array field, creating a new document for each item in the
array.
7. $lookup: The $lookup stage performs a left outer join with another collection.
Example:
db.orders.aggregate([
{ $lookup: {
from: "products",
localField: "product_id",
foreignField: "_id",
as: "product_info"
}}
]);
o This performs a left outer join between the orders collection and the products
collection, matching product_id from orders with _id in products, and stores the
joined data in a field called product_info.

Example of a Complete Aggregation Pipeline:


Below is an example of a full aggregation pipeline that combines multiple stages.
Scenario: You want to get the top 3 regions with the highest total sales.
Pipeline:
db.sales.aggregate([
{ $match: { region: { $in: ["North", "South", "East", "West"] } } }, // Match specific regions
{ $group: { _id: "$region", total_sales: { $sum: "$amount" } } }, // Group by region and calculate
total sales
{ $sort: { total_sales: -1 } }, // Sort by total sales in descending order
{ $limit: 3 } // Limit the result to top 3 regions
]);

4. Write a Program in MongoDB using the aggregate function to calculate the


total sales revenue for each product category in a collection.

Program in MongoDB Using the Aggregate Function to Calculate Total Sales Revenue for
Each Product Category
In this example, we will calculate the total sales revenue for each product category in a
MongoDB collection named sales. The sales collection contains documents with information
about the products sold, their prices, quantities, and categories.
Collection Structure (sales):

{
"_id": ObjectId("..."),
"product_name": "Laptop",
"category": "Electronics",
"price": 1000,
"quantity": 10
}

Steps to Calculate Total Sales Revenue:


• Step 1: Use the $group stage to group the documents by category.
• Step 2: Use the $sum operator to calculate the total revenue for each category by
multiplying the price and quantity fields.
• Step 3: Optionally, use the $sort stage to order the categories by the total sales revenue
in descending order.
MongoDB Aggregation Pipeline Program:
db.sales.aggregate([
{
// Step 1: Group by category
$group: {
_id: "$category", // Group by 'category'
total_revenue: {
$sum: { $multiply: ["$price", "$quantity"] } // Multiply price and quantity for total sales
revenue
}
}
},
{
// Step 2: Sort by total_revenue in descending order
$sort: { total_revenue: -1 }
}
]);
Explanation:
1. $group Stage:
o _id: "$category": Groups the documents by the category field.
o total_revenue: { $sum: { $multiply: ["$price", "$quantity"] } }: Calculates the
total revenue for each category by multiplying the price and quantity fields and
summing the results for each category.
2. $sort Stage:
o $sort: { total_revenue: -1 }: Sorts the categories by the calculated total revenue
in descending order (i.e., highest revenue first).
Example Output:
The output will be a list of product categories with their corresponding total sales revenue:
[
{
"_id": "Electronics",
"total_revenue": 15000
},
{
"_id": "Clothing",
"total_revenue": 8000
},
{
"_id": "Furniture",
"total_revenue": 4000
}
]

5. How does sorting and searching work in the MapReduce framework?


Explain with an example.

Sorting and Searching in the MapReduce Framework (8 Marks)


Sorting and searching are key operations in the MapReduce framework, used to organize
data and efficiently find specific results during data processing.
Sorting in MapReduce:
1. Map Phase:
o The Mapper outputs key-value pairs, but sorting is not performed here.
o Sorting occurs later in the pipeline.
2. Shuffle and Sort Phase:
o The MapReduce framework automatically sorts the key-value pairs emitted by
the Mappers by key.
o All values for a specific key are grouped together and sent to the same Reducer.
3. Reduce Phase:
o The Reducer receives sorted key-value pairs and performs operations such as
summing or filtering. Sorting ensures that the data is processed in a predictable
order.
Searching in MapReduce:
1. Map Phase:
o Searching/filtering can be done by emitting only specific key-value pairs based
on a condition (e.g., specific words or ranges).
2. Reduce Phase:
o In the Reduce phase, complex searching tasks like top N or aggregation can be
performed after sorting the data.
Example: Finding the Top N Frequent Words:
1. Map Phase: Mapper emits key-value pairs where the key is the word, and the value is 1.
Example output:
("apple", 1)
("orange", 1)
("banana", 1)
2. Shuffle and Sort: Automatically groups and sorts the words by frequency.
Grouped output:
("apple", [1, 1, 1])
("orange", [1, 1])
("banana", [1])
3. Reduce Phase: Reducer sums up the values for each key (word), providing the total
count.
Output:
("apple", 3)
("orange", 2)
("banana", 1)
4. Final Sorting: Sorting the results gives the top N most frequent words.

6. Write a MapReduce program to count the number of occurrences of each


word in a given dataset.

MapReduce Program to Count the Number of Occurrences of Each Word


In this program, we'll write a MapReduce job to count the occurrences of each word in a given
dataset (e.g., a text file). The MapReduce framework will break the task into two phases: the
Map phase and the Reduce phase.
Steps to Implement the Word Count Program:
1. Map Phase:
o The Mapper will read each line of the input text file, split it into words, and emit
each word with a count of 1.
2. Reduce Phase:
o The Reducer will take all the values (counts) for each word and sum them up to
compute the total occurrences of that word in the dataset.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {

// Mapper class
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();
@Override
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
// Split the input line into words
String[] words = value.toString().split("\\s+");

for (String word : words) {


word.set(word); // Set the word as key
context.write(word, one); // Emit word with count 1
}
}
}

// Reducer class
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
int sum = 0;

// Sum up all the occurrences of the word


for (IntWritable val : values) {
sum += val.get();
}

result.set(sum); // Set the total count for the word


context.write(key, result); // Emit the word and its total count
}
}

// Main method to set up the job


public static void main(String[] args) throws Exception {
// Configure the job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);

// Set Mapper and Reducer classes


job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class); // Optional, to reduce intermediate data
job.setReducerClass(IntSumReducer.class);
// Set output key and value types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

// Set input and output file paths


FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Submit the job and wait for completion


System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

7. Discuss the Hive architecture with a neat diagram

Hive Architecture Overview


Hive is a data warehousing and SQL-like query language that is built on top of Hadoop. It
provides an interface for querying and managing large datasets residing in Hadoop Distributed
File System (HDFS). The architecture of Hive is designed to allow users to query data stored in
Hadoop using a simplified, SQL-like syntax (HiveQL). It abstracts the complexities of writing
MapReduce programs directly and provides an easy-to-use platform for data analysts.
Components of Hive Architecture:
1. Hive Client:
o This is the interface through which users interact with Hive. Users can submit
queries using Hive CLI, JDBC, ODBC, or Web UI (like Beeline or Hive Web UI).
2. Hive Driver:
o The Driver component receives the queries submitted by the users and
manages the lifecycle of query execution.
o It parses the query, compiles it, optimizes it, and finally executes it by generating
the appropriate MapReduce jobs (or other execution plans).
3. Compiler:
o The Compiler translates the HiveQL queries into a series of MapReduce jobs
that can be run on Hadoop.
o It performs tasks such as syntax checking, semantic analysis, query
optimization, and query plan generation.
4. Execution Engine:
o The Execution Engine is responsible for executing the MapReduce jobs
generated by the compiler.
o It interfaces with Hadoop's MapReduce framework to execute the tasks in
parallel across the Hadoop cluster.
o It can also execute Tez or Spark jobs (in newer versions of Hive that support
execution engines beyond MapReduce).
5. MetaStore:
o MetaStore is a central repository where all metadata about the tables, partitions,
and schemas are stored.
o It contains information such as the table schema, location of data in HDFS,
and other metadata.
o The MetaStore can be stored in a relational database like MySQL, PostgreSQL,
etc.
6. Hive SerDe (Serializer/Deserializer):
o SerDe is used to convert the data from its original format into a format that Hive
can work with and vice versa.
o It supports various file formats like Text, Avro, Parquet, ORC, etc.
7. HDFS:
o HDFS (Hadoop Distributed File System) is where the actual data resides. Hive
stores large amounts of structured and unstructured data in HDFS, and the
Execution Engine interacts with this data during query execution.
8. Hadoop:
o Hive utilizes Hadoop MapReduce for query execution (though it can also use
Tez or Spark as an execution engine). MapReduce jobs process the data in
HDFS and return the results back to the Hive system.
8. Write a R program by creating a function to calculate factorial of a number
using an iterative approach.

R Program to Calculate Factorial of a Number Using Iterative Approach


In this program, we will create a function in R that calculates the factorial of a number using an
iterative approach. The factorial of a number n is the product of all positive integers less than
or equal to n. The factorial is represented as n! and calculated as:
• n! = n * (n-1) * (n-2) * ... * 1

R Program Code:
# Function to calculate factorial of a number using iterative approach
factorial_iterative <- function(n) {
# Initialize the result variable to 1
result <- 1

# Loop to calculate factorial


for (i in 1:n) {
result <- result * i
}

# Return the calculated factorial


return(result)
}

# Example usage of the function


number <- as.integer(readline(prompt = "Enter a number to calculate its factorial: "))

# Check if the number is non-negative


if (number < 0) {
print("Factorial is not defined for negative numbers")
} else {
# Call the function and display the result
fact_result <- factorial_iterative(number)
cat("The factorial of", number, "is", fact_result, "\n")
}
9. Describe the CRUD operations in MongoDB with an example.

CRUD Operations in MongoDB


CRUD stands for Create, Read, Update, and Delete, which are the four basic operations you
can perform on a MongoDB database. MongoDB, being a NoSQL database, provides a set of
methods to perform these operations efficiently.
Below is a description of each CRUD operation in MongoDB with examples:
1. Create Operation
The Create operation is used to insert documents into a collection.
• Method: insertOne() and insertMany()
• Example: Inserting a single document into a collection

// Connect to the MongoDB database


db = connect('mongodb://localhost:27017/testdb');

// Inserting a single document into the "employees" collection


db.employees.insertOne({
name: "John Doe",
age: 30,
position: "Software Engineer",
department: "IT"
});

// Inserting multiple documents into the "employees" collection


db.employees.insertMany([
{
name: "Alice Smith",
age: 25,
position: "Data Analyst",
department: "Data Science"
},
{
name: "Bob Johnson",
age: 35,
position: "Project Manager",
department: "IT"
}
]);
• Explanation:
o insertOne() is used to insert a single document.
o insertMany() is used to insert multiple documents at once.
2. Read Operation
The Read operation is used to retrieve documents from a collection.
• Method: find() and findOne()
• Example: Finding documents in the "employees" collection

// Find all employees in the "employees" collection


db.employees.find();

// Find employees with a specific condition


db.employees.find({ department: "IT" });

// Find a single employee by name


db.employees.findOne({ name: "John Doe" });
• Explanation:
o find() returns all documents that match the specified query.
o findOne() returns a single document that matches the query condition.
3. Update Operation
The Update operation is used to modify an existing document in a collection.
• Method: updateOne(), updateMany(), and replaceOne()
• Example: Updating documents in the "employees" collection

// Update a single employee's position in the "employees" collection


db.employees.updateOne(
{ name: "John Doe" }, // Filter condition
{ $set: { position: "Senior Software Engineer" } } // Update operation
);

// Update multiple employees' department to "Engineering"


db.employees.updateMany(
{ department: "IT" },
{ $set: { department: "Engineering" } }
);

// Replace a document with new data


db.employees.replaceOne(
{ name: "Alice Smith" },
{ name: "Alice Brown", age: 26, position: "Senior Data Analyst", department: "Data Science" }
);
• Explanation:
o updateOne() updates the first document that matches the query.
o updateMany() updates all documents that match the query.
o replaceOne() replaces an entire document with a new one.
4. Delete Operation
The Delete operation is used to remove documents from a collection.
• Method: deleteOne() and deleteMany()
• Example: Deleting documents from the "employees" collection

// Delete a single employee by name


db.employees.deleteOne({ name: "John Doe" });

// Delete multiple employees from the "employees" collection


db.employees.deleteMany({ department: "Engineering" });
• Explanation:
o deleteOne() deletes the first document that matches the query.
o deleteMany() deletes all documents that match the query condition.

10. Write an R program to perform data visualization using the ggplot2 library.

R Program to Perform Data Visualization Using ggplot2


In this program, we will use the ggplot2 library to visualize data in R. ggplot2 is one of the most
popular libraries for data visualization and is part of the tidyverse package. It allows users to
create elegant and customizable plots.
Steps:
1. Install and Load ggplot2: First, we need to install the ggplot2 package (if not already
installed) and then load it.
2. Create Sample Data: We will create a sample data frame for visualization.
3. Create Plots: We will create different types of plots such as a scatter plot, bar plot, and
histogram.
R Program Code:
r
CopyEdit
# Install ggplot2 if it's not installed already
# install.packages("ggplot2")

# Load ggplot2 library


library(ggplot2)

# Step 1: Create a Sample Data Frame


data <- data.frame(
Category = c('A', 'B', 'C', 'D', 'E'),
Value = c(23, 45, 56, 78, 89),
Age = c(25, 30, 35, 40, 45)
)

# Step 2: Create a Bar Plot to visualize the 'Value' for each 'Category'
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Plot of Category vs Value") +
xlab("Category") +
ylab("Value") +
theme_minimal()

# Step 3: Create a Scatter Plot to visualize 'Age' vs 'Value'


ggplot(data, aes(x = Age, y = Value)) +
geom_point(color = "blue", size = 3) +
ggtitle("Scatter Plot of Age vs Value") +
xlab("Age") +
ylab("Value") +
theme_minimal()

# Step 4: Create a Histogram for the 'Value' column to show distribution


ggplot(data, aes(x = Value)) +
geom_histogram(bins = 5, fill = "skyblue", color = "black", alpha = 0.7) +
ggtitle("Histogram of Value") +
xlab("Value") +
ylab("Frequency") +
theme_minimal()

11. Generalize HIVE commands with an example.

General Hive Commands with Examples


Hive is a data warehouse system built on top of Hadoop, which provides an SQL-like query
language known as HiveQL to query and manage large datasets. Hive is commonly used to
work with large-scale structured data in a Hadoop ecosystem. Below are some common Hive
commands along with examples.
1. Creating Databases
Hive allows you to create databases for organizing tables.
• Command: CREATE DATABASE
Syntax:
CREATE DATABASE database_name;
Example:
CREATE DATABASE employee_db;

2. Creating a Table
In Hive, you can create a table with a defined schema to store data.
• Command: CREATE TABLE
Syntax:
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'delimiter';
Example:
CREATE TABLE employee_details (
emp_id INT,
emp_name STRING,
emp_age INT,
emp_salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

3. Loading Data into a Table


After creating a table, you can load data from a file into the table.
• Command: LOAD DATA
Syntax:
LOAD DATA INPATH 'hdfs_path' INTO TABLE table_name;
Example:
LOAD DATA INPATH '/user/hadoop/employee_data.csv' INTO TABLE employee_details;

4. Querying Data (SELECT)


Hive supports SQL-like queries to retrieve data from tables.
• Command: SELECT
Syntax:
SELECT column1, column2, ... FROM table_name WHERE condition;
Example:
SELECT emp_name, emp_salary FROM employee_details WHERE emp_age > 30;

5. Dropping a Table
To remove a table from Hive, use the DROP TABLE command.
• Command: DROP TABLE
Syntax:
DROP TABLE table_name;
Example:
DROP TABLE employee_details;

6. Altering a Table
You can alter a table's schema by adding, modifying, or dropping columns.
• Command: ALTER TABLE
Syntax:
ALTER TABLE table_name ADD COLUMNS (column_name datatype);
Example:
ALTER TABLE employee_details ADD COLUMNS (emp_department STRING);

7. Dropping a Database
To remove a database from Hive, use the DROP DATABASE command. The database must be
empty to drop it.
• Command: DROP DATABASE
Syntax:
DROP DATABASE database_name;
Example:
DROP DATABASE employee_db;

8. Listing Tables
You can list all tables in the current database.
• Command: SHOW TABLES
Syntax:
SHOW TABLES;
Example:
SHOW TABLES;

9. Describing a Table
To view the schema of a table, use the DESCRIBE command.
• Command: DESCRIBE
Syntax:
DESCRIBE table_name;
Example:
DESCRIBE employee_details;

10. Inserting Data into a Table


You can insert data into a table using the INSERT INTO command.
• Command: INSERT INTO
Syntax:
INSERT INTO TABLE table_name VALUES (value1, value2, ...);
Example:
INSERT INTO TABLE employee_details VALUES (1, 'John Doe', 28, 50000.0);
12. Compare the functionalities and use cases of MongoDB and traditional
relational databases.

13. Elaborate the Mapper and Reducer task with a neat sketch.
Elaboration of Mapper and Reducer Task in MapReduce with Diagram
MapReduce is a distributed data processing framework used in Hadoop to handle large-scale
data. It consists of two key phases:
1. Mapper Phase – Processes input data and converts it into key-value pairs.
2. Reducer Phase – Aggregates and processes intermediate key-value pairs to generate
the final output.

1. Mapper Task
• The Mapper takes input data, processes it, and emits key-value pairs as intermediate
output.
• It runs in parallel on multiple nodes to increase efficiency.
Example: Word Count in a Document
• Input: A text file containing sentences.
• The Mapper reads the file and splits it into words.
• It emits each word with a count of 1.
Input File Content:

Hello Hadoop
Hello Big Data
Mapper Output (Key-Value Pairs):

(Hello, 1)
(Hadoop, 1)
(Hello, 1)
(Big, 1)
(Data, 1)

2. Reducer Task
• The Reducer takes the output from the Mapper, aggregates the values based on keys,
and produces the final result.
• It processes data after shuffling & sorting, where keys with the same values are
grouped together.
Reducer Input (After Shuffling & Sorting):

(Big, [1])
(Data, [1])
(Hadoop, [1])
(Hello, [1,1])
Reducer Output (Final Word Count):

(Big, 1)
(Data, 1)
(Hadoop, 1)
(Hello, 2)

14. Write a Map reduce program to sort data by student name.

MapReduce Program to Sort Data by Student Name


In this MapReduce program, we will sort student names in ascending order. The Mapper will
read the student data and emit the name as the key and other details as the value. The
Reducer will collect the sorted names and output the final sorted list.

1. Input Data (students.txt)


Assume we have a file with student records in the format:
CopyEdit
102, Alice, 85
101, Bob, 90
103, Charlie, 78
104, David, 88
(Format: StudentID, Name, Marks)

2. MapReduce Implementation
Mapper Class
• Reads input records.
• Emits student name as the key and the rest of the record as the value.
Reducer Class
• Since keys are automatically sorted by Hadoop during the shuffle phase, the Reducer
simply outputs them in sorted order.
MapReduce Java Program for Sorting by Name

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class StudentSort {

// Mapper Class
public static class NameMapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String[] fields = value.toString().split(","); // Split by comma
if (fields.length == 3) {
String studentName = fields[1].trim(); // Name as key
String studentData = fields[0] + "," + fields[2]; // StudentID, Marks as value
context.write(new Text(studentName), new Text(studentData));
}
}
}

// Reducer Class
public static class NameReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
for (Text value : values) {
context.write(key, value); // Output sorted by key (student name)
}
}
}

// Driver Code
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sort Students by Name");
job.setJarByClass(StudentSort.class);
job.setMapperClass(NameMapper.class);
job.setReducerClass(NameReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

15. Differentiate between the various file compression techniques in


MapReduce and their impact on performance.

Impact on MapReduce Performance


1. Splittability:
oSplittable formats (Bzip2, LZO, Zstandard) allow parallel processing across
multiple nodes, improving efficiency.
o Non-splittable formats (Gzip, Snappy) require a single node to process the entire
file, reducing parallelism.
2. Compression Ratio vs. CPU Overhead:
o Higher compression ratios (Bzip2, Gzip) save disk space but require more
CPU power for decompression.
o Lower compression ratios (LZO, Snappy) are optimized for speed, reducing
CPU overhead.
3. Best Practices for MapReduce Jobs:
o Use Bzip2 or LZO for input files to enable splitting and parallel processing.
o Use Gzip or Zstandard for output files if storage efficiency is a priority.
o Use Snappy for applications where real-time speed is more important than
compression.

16. Write a Map reduce Program to search a specific keyword in a file.

MapReduce Program to Search for a Specific Keyword in a File


In this MapReduce program, we will search for a specific keyword in a text file and output the
lines that contain the keyword.

1. Input File (input.txt)


Example file content:
kotlin
CopyEdit
Hadoop is a distributed computing framework.
MapReduce is a programming model for big data.
Hadoop and Spark are used for large-scale data processing.
Data analytics is an important field in big data.
If the search keyword is "Hadoop", the program should return lines that contain the word
"Hadoop".

2. MapReduce Java Program for Keyword Search


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KeywordSearch {

public static class KeywordMapper extends Mapper<Object, Text, Text, Text> {


private String searchKeyword;

@Override
protected void setup(Context context) {
Configuration conf = context.getConfiguration();
searchKeyword = conf.get("keyword"); // Get keyword from configuration
}

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
if (line.contains(searchKeyword)) { // Check if line contains the keyword
context.write(new Text("Matching Line:"), new Text(line));
}
}
}

public static class KeywordReducer extends Reducer<Text, Text, Text, Text> {


public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
for (Text value : values) {
context.write(key, value); // Output matching lines
}
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
conf.set("keyword", args[2]); // Set the keyword to search

Job job = Job.getInstance(conf, "Keyword Search");


job.setJarByClass(KeywordSearch.class);
job.setMapperClass(KeywordMapper.class);
job.setReducerClass(KeywordReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
17. Summarize the Architecture of HIVE in detail.

Hive Architecture Overview


Hive is a data warehousing and SQL-like query language that is built on top of Hadoop. It
provides an interface for querying and managing large datasets residing in Hadoop Distributed
File System (HDFS). The architecture of Hive is designed to allow users to query data stored in
Hadoop using a simplified, SQL-like syntax (HiveQL). It abstracts the complexities of writing
MapReduce programs directly and provides an easy-to-use platform for data analysts.
Components of Hive Architecture:
1. Hive Client:
o This is the interface through which users interact with Hive. Users can submit
queries using Hive CLI, JDBC, ODBC, or Web UI (like Beeline or Hive Web UI).
2. Hive Driver:
o The Driver component receives the queries submitted by the users and
manages the lifecycle of query execution.
o It parses the query, compiles it, optimizes it, and finally executes it by generating
the appropriate MapReduce jobs (or other execution plans).
3. Compiler:
o The Compiler translates the HiveQL queries into a series of MapReduce jobs
that can be run on Hadoop.
o It performs tasks such as syntax checking, semantic analysis, query
optimization, and query plan generation.
4. Execution Engine:
o The Execution Engine is responsible for executing the MapReduce jobs
generated by the compiler.
o It interfaces with Hadoop's MapReduce framework to execute the tasks in
parallel across the Hadoop cluster.
o It can also execute Tez or Spark jobs (in newer versions of Hive that support
execution engines beyond MapReduce).
5. MetaStore:
o MetaStore is a central repository where all metadata about the tables, partitions,
and schemas are stored.
o It contains information such as the table schema, location of data in HDFS,
and other metadata.
o The MetaStore can be stored in a relational database like MySQL, PostgreSQL,
etc.
6. Hive SerDe (Serializer/Deserializer):
o SerDe is used to convert the data from its original format into a format that Hive
can work with and vice versa.
o It supports various file formats like Text, Avro, Parquet, ORC, etc.
7. HDFS:
o HDFS (Hadoop Distributed File System) is where the actual data resides. Hive
stores large amounts of structured and unstructured data in HDFS, and the
Execution Engine interacts with this data during query execution.
8. Hadoop:
o Hive utilizes Hadoop MapReduce for query execution (though it can also use
Tez or Spark as an execution engine). MapReduce jobs process the data in
HDFS and return the results back to the Hive system.

18. Write a R program by creating a function to calculate factorial of a number


using an iterative approach.

R Program to Calculate Factorial of a Number Using Iterative Approach


In this program, we will create a function in R that calculates the factorial of a number using an
iterative approach. The factorial of a number n is the product of all positive integers less than
or equal to n. The factorial is represented as n! and calculated as:
• n! = n * (n-1) * (n-2) * ... * 1

R Program Code:
# Function to calculate factorial of a number using iterative approach
factorial_iterative <- function(n) {
# Initialize the result variable to 1
result <- 1

# Loop to calculate factorial


for (i in 1:n) {
result <- result * i
}

# Return the calculated factorial


return(result)
}

# Example usage of the function


number <- as.integer(readline(prompt = "Enter a number to calculate its factorial: "))

# Check if the number is non-negative


if (number < 0) {
print("Factorial is not defined for negative numbers")
} else {
# Call the function and display the result
fact_result <- factorial_iterative(number)
cat("The factorial of", number, "is", fact_result, "\n")
}

19. Evaluate the advantages and limitations of integrating MapReduce with R


in data analytics.

Advantages and Limitations of Integrating MapReduce with R in Data Analytics


MapReduce and R can be integrated using RHadoop, RHIPE, or SparkR for large-scale data
analytics. This integration combines the distributed processing power of MapReduce with R’s
statistical and machine-learning capabilities.
20. Describe the CRUD operations in MongoDB with an example.

CRUD Operations in MongoDB


CRUD stands for Create, Read, Update, and Delete, which are the four basic operations you
can perform on a MongoDB database. MongoDB, being a NoSQL database, provides a set of
methods to perform these operations efficiently.
Below is a description of each CRUD operation in MongoDB with examples:
1. Create Operation
The Create operation is used to insert documents into a collection.
• Method: insertOne() and insertMany()
• Example: Inserting a single document into a collection

// Connect to the MongoDB database


db = connect('mongodb://localhost:27017/testdb');
// Inserting a single document into the "employees" collection
db.employees.insertOne({
name: "John Doe",
age: 30,
position: "Software Engineer",
department: "IT"
});

// Inserting multiple documents into the "employees" collection


db.employees.insertMany([
{
name: "Alice Smith",
age: 25,
position: "Data Analyst",
department: "Data Science"
},
{
name: "Bob Johnson",
age: 35,
position: "Project Manager",
department: "IT"
}
]);
• Explanation:
o insertOne() is used to insert a single document.
o insertMany() is used to insert multiple documents at once.
2. Read Operation
The Read operation is used to retrieve documents from a collection.
• Method: find() and findOne()
• Example: Finding documents in the "employees" collection

// Find all employees in the "employees" collection


db.employees.find();

// Find employees with a specific condition


db.employees.find({ department: "IT" });

// Find a single employee by name


db.employees.findOne({ name: "John Doe" });
• Explanation:
o find() returns all documents that match the specified query.
o findOne() returns a single document that matches the query condition.
3. Update Operation
The Update operation is used to modify an existing document in a collection.
• Method: updateOne(), updateMany(), and replaceOne()
• Example: Updating documents in the "employees" collection

// Update a single employee's position in the "employees" collection


db.employees.updateOne(
{ name: "John Doe" }, // Filter condition
{ $set: { position: "Senior Software Engineer" } } // Update operation
);

// Update multiple employees' department to "Engineering"


db.employees.updateMany(
{ department: "IT" },
{ $set: { department: "Engineering" } }
);

// Replace a document with new data


db.employees.replaceOne(
{ name: "Alice Smith" },
{ name: "Alice Brown", age: 26, position: "Senior Data Analyst", department: "Data Science" }
);
• Explanation:
o updateOne() updates the first document that matches the query.
o updateMany() updates all documents that match the query.
o replaceOne() replaces an entire document with a new one.
4. Delete Operation
The Delete operation is used to remove documents from a collection.
• Method: deleteOne() and deleteMany()
• Example: Deleting documents from the "employees" collection

// Delete a single employee by name


db.employees.deleteOne({ name: "John Doe" });

// Delete multiple employees from the "employees" collection


db.employees.deleteMany({ department: "Engineering" });
• Explanation:
o deleteOne() deletes the first document that matches the query.
o deleteMany() deletes all documents that match the query condition.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy