0% found this document useful (0 votes)

142 views

Stream Processing Lab Manual

Uploaded by

kaleeswari090204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views

Stream Processing Lab Manual

Uploaded by

kaleeswari090204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 38

1.

Install MongoDB

Aim:

MongoDB is a NoSQL database that stores data in a flexible, JSON-like format. Below are the
steps to install MongoDB on different platforms.

1. Install MongoDB on Ubuntu (Linux)

Step 1: Update the Package Database

bash
Copy code
sudo apt update

Step 2: Install MongoDB Dependencies

Install the necessary dependencies:

bash
Copy code
sudo apt install -y libcurl4 openssl liblzma5

Step 3: Add MongoDB Repository

To install MongoDB, add the MongoDB repository to your system:

1. Import the MongoDB public key:

bash
Copy code
wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-
key add -

2. Add the MongoDB repository to your sources list:

bash
Copy code
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu
focal/mongodb-org/5.0 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodb-org-5.0.list

Step 4: Install MongoDB

Update the local package database again and install MongoDB:

bash
Copy code
sudo apt update
sudo apt install -y mongodb-org

Step 5: Start MongoDB Service

Start the MongoDB service:

bash
Copy code
sudo systemctl start mongod

Step 6: Enable MongoDB to Start on Boot

Enable MongoDB to start automatically when the system boots:

bash
Copy code
sudo systemctl enable mongod

Step 7: Verify MongoDB Installation

Check the status of MongoDB to ensure it's running:

bash
Copy code
sudo systemctl status mongod

You can also connect to the MongoDB shell by typing:

bash
Copy code
mongo

2. Install MongoDB on macOS

Step 1: Install Homebrew

If you don't have Homebrew installed, install it by running:

bash
Copy code
/bin/bash -c "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Tap the MongoDB Formula

Run the following command to tap the official MongoDB formula:

bash
Copy code
brew tap mongodb/brew

Step 3: Install MongoDB

Now, install MongoDB using Homebrew:

bash
Copy code
brew install mongodb-community@5.0

Step 4: Start MongoDB

Start MongoDB using Homebrew services:

bash
Copy code
brew services start mongodb/brew/mongodb-community

Step 5: Verify Installation

To verify the MongoDB service is running:

bash
Copy code
brew services list

To access the MongoDB shell:

bash
Copy code
mongo

3. Install MongoDB on Windows

Step 1: Download MongoDB Installer

 Go to the MongoDB Download Center.

 Choose Windows as the operating system and download the MSI installer.

Step 2: Run the Installer

 Launch the downloaded .msi file.

 Follow the installation wizard:
o Choose Complete setup.
o Make sure Install MongoDB as a Service is selected.
o Enable Run MongoDB as a Service and set it to start automatically.

Step 3: Add MongoDB to the PATH

 During installation, make sure the option to add MongoDB to your system's PATH environment
variable is selected.

Step 4: Start MongoDB

MongoDB should start automatically as a service. If not, you can manually start it:

1. Open a Command Prompt and run:

bash
Copy code
net start MongoDB

Step 5: Verify MongoDB Installation

To check if MongoDB is running, open a Command Prompt and type:

bash
Copy code
mongo

4. Test MongoDB Installation

To check if MongoDB is working properly, you can connect to the MongoDB shell. Type the
following command in your terminal or command prompt:

bash
Copy code
mongo

This will start the MongoDB shell, and you'll see a prompt like this:

markdown
Copy code
>

You can then run commands like show dbs to see the databases or use test to switch to the
test database.

5. (Optional) Configure MongoDB for Remote Access

If you want to enable remote access to your MongoDB instance (for example, from a different
machine), you will need to:

1. Edit the MongoDB configuration file (mongod.conf):

o On Linux and macOS, this file is typically located at /etc/mongod.conf.
o On Windows, find the MongoDB config file mongod.cfg.

2. Bind MongoDB to All IP Addresses:

o Open the mongod.conf file and find the bindIp field under net.
o Set it to 0.0.0.0 to allow connections from any IP address (be cautious as this opens
up MongoDB to the entire network):

yaml
Copy code
net:
bindIp: 0.0.0.0

3. Restart MongoDB:
o Restart the MongoDB service to apply the changes:
 On Linux/macOS:

bash
Copy code
sudo systemctl restart mongod

 On Windows, restart the MongoDB service via the Services application or using
net start MongoDB.

6. (Optional) Secure MongoDB with Authentication

If you want to secure your MongoDB instance:

1. Create an Admin User: Open the MongoDB shell and run the following commands:

bash
Copy code
use admin
db.createUser({ user: "admin", pwd: "password", roles: [{ role:
"userAdminAnyDatabase", db: "admin" }] })

2. Enable Authentication:
o Edit the mongod.conf file and enable authentication:

yaml
Copy code
security:
authorization: "enabled"
3. Restart MongoDB:
o Restart MongoDB to apply the changes.

Now, MongoDB requires authentication. To connect:

bash
Copy code
mongo -u admin -p password --authenticationDatabase admin

output

Here's a simplified output you might see when installing MongoDB on Ubuntu:

Bash
Reading package lists... Done
Building dependency tree
Reading state information
The following NEW packages will be installed: 1

mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-

org-tools
0 upgraded, 5 newly installed, 0 to remove and 0 not upgraded.
Need to get 133 MB of archives.
After this operation, 322 MB of additional disk space will be used.
Get:1 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-server amd64 5.0.15 [67.0 MB]
Get:2 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-shell amd64 5.0.15 [28.6 MB]
Get:3 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-mongos amd64 5.0.15 [18.9 MB]
Get:4 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-tools amd64 5.0.15 [18.5 MB]
Get:5 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org amd64 5.0.15 [1.0 MB]
Fetched 133 MB in 9s (14.7 MB/s)
Selecting previously unselected package mongodb-org-server.
(Reading database ... 166866 files and directories currently installed.)
Preparing to unpack .../mongodb-org-server_5.0.15_amd64.deb ...
Unpacking mongodb-org-server (5.0.15) ...
[... other packages being unpacked ...]
Setting up mongodb-org-server (5.0.15) ...
Setting up mongodb-org-shell (5.0.15) ...
Setting up mongodb-org-mongos (5.0.15) ...
Setting up mongodb-org-tools (5.0.15) ...
Setting up mongodb-org (5.0.15) ...
After the installation:

Bash
mongod --version
MongoDB shell version v5.0.15

Result:

You've successfully installed MongoDB on your system. MongoDB is now ready for use in
storing, querying, and managing your NoSQL data. You can now integrate it into your
applications or use it directly for data storage in real-time applications like analytics, logging, or
content management.

2. Design and Implement Simple application using

MongoDB

This guide demonstrates how to build a simple CRUD (Create, Read, Update, Delete)
application using MongoDB and Node.js. The app will perform basic operations on a MongoDB
database to manage user data (name, email, and age).

Step 1: Set Up Node.js Project

1. Create a new directory and initialize a Node.js project:

bash
Copy code
mkdir mongo-crud-app
cd mongo-crud-app
npm init -y

2. Install necessary dependencies:

bash
Copy code
npm install express mongodb
Step 2: Create Application

Create a file app.js and set up the server and MongoDB connection.

javascript
Copy code
const express = require('express');
const { MongoClient, ObjectId } = require('mongodb');
const app = express();

app.use(express.json());

// MongoDB connection URL and database name

const url = 'mongodb://localhost:27017';
const dbName = 'crudApp';
let db;

// Connect to MongoDB
MongoClient.connect(url, { useNewUrlParser: true, useUnifiedTopology: true })
.then(client => {
console.log('Connected to MongoDB');
db = client.db(dbName);
})
.catch(err => console.error('Failed to connect to MongoDB:', err));

// Routes
app.get('/', (req, res) => res.send('MongoDB CRUD Application'));

app.post('/users', async (req, res) => {

try {
const user = req.body;
const result = await db.collection('users').insertOne(user);
res.status(201).json(result.ops[0]);
} catch (err) {
res.status(500).json({ error: 'Failed to create user' });
}
});

app.get('/users', async (req, res) => {

try {
const users = await db.collection('users').find().toArray();
res.status(200).json(users);
} catch (err) {
res.status(500).json({ error: 'Failed to fetch users' });
}
});

app.put('/users/:id', async (req, res) => {

try {
const { id } = req.params;
const updatedUser = req.body;
const result = await db.collection('users').updateOne(
{ _id: new ObjectId(id) },
{ $set: updatedUser }
);
if (result.matchedCount > 0) {
res.status(200).json({ message: 'User updated' });
} else {
res.status(404).json({ message: 'User not found' });
}
} catch (err) {
res.status(500).json({ error: 'Failed to update user' });
}
});

app.delete('/users/:id', async (req, res) => {

try {
const { id } = req.params;
const result = await db.collection('users').deleteOne({ _id: new
ObjectId(id) });
if (result.deletedCount > 0) {
res.status(200).json({ message: 'User deleted' });
} else {
res.status(404).json({ message: 'User not found' });
}
} catch (err) {
res.status(500).json({ error: 'Failed to delete user' });
}
});

// Start server
const port = 3000;
app.listen(port, () => console.log(`Server is running on http://localhost:$
{port}`));

Step 3: Test the Application

Start the server by running:

bash
Copy code
node app.js

Use tools like Postman or cURL to test CRUD operations.

1. Create User (POST /users): Example request body:

json
Copy code
{ "name": "John Doe", "email": "john@example.com", "age": 30 }

2. Get Users (GET /users): Fetch all users.

3. Update User (PUT /users/:id): Example body for update:

json
Copy code
{ "name": "John Smith", "email": "john.smith@example.com", "age": 31 }
4. Delete User (DELETE /users/:id): Delete user by their unique _id.

Step 4: Verify Data in MongoDB

To verify the data in MongoDB, open the shell and run:

bash
Copy code
mongo
use crudApp
db.users.find().pretty()

Output

> show dbs

admin 0.000GB

local 0.000GB

> use myDatabase

switched to db myDatabase

> db.createCollection("users")

{ "ok" : 1 }

> db.users.insertOne({ name: "Alice", age: 30 })

{ "acknowledged" : true, "insertedId" : ObjectId("650d6509341234567890abc") }

> db.users.find()

"_id" : ObjectId("650d6509341234567890abc"),

"name" : "Alice",

"age" : 30

Result:
This simple application demonstrates how to perform basic CRUD operations with MongoDB
and Node.js. You can extend this by adding validation, authentication, or advanced MongoDB
features like aggregation. This CRUD application can be further built upon to create complex
systems for real-world applications.

3. Query the designed system using MongoDB

Aim:

Once you've set up your CRUD application using MongoDB and Node.js, you may want to
query the system to retrieve, update, or delete data based on specific conditions. In this section,
we'll explore various MongoDB query operations that you can perform on your database.

Procedure:

1. Query All Users

To retrieve all users in the users collection:

MongoDB Shell Command:

js
Copy code
db.users.find().pretty()

This command will return all documents (users) in the users collection, formatted neatly for
better readability.

2. Query a Single User by ID

To find a user by their unique _id field:

MongoDB Shell Command:

js
Copy code
db.users.find({ _id: ObjectId("user-id-here") }).pretty()
Replace "user-id-here" with the actual _id of the user you want to query. In MongoDB, the
_id is typically an ObjectId, which you can generate or retrieve programmatically.

Example with a sample _id:

js
Copy code
db.users.find({ _id: ObjectId("613b2a5b9e24f53c6a05689b") }).pretty()

3. Query Users by Specific Field (e.g., name)

If you want to retrieve users based on a specific field, like name:

MongoDB Shell Command:

js
Copy code
db.users.find({ name: "John Doe" }).pretty()

This will return all users whose name is John Doe.

4. Query Users with Conditions (e.g., Age Greater Than 25)

To retrieve users who meet specific conditions (e.g., age > 25):

MongoDB Shell Command:

js
Copy code
db.users.find({ age: { $gt: 25 } }).pretty()

This will return all users where the age field is greater than 25. You can use various comparison
operators such as $lt (less than), $gte (greater than or equal to), $lte (less than or equal to),
and $ne (not equal to).

Example:

 Greater than 30: db.users.find({ age: { $gt: 30 } })

 Less than 40: db.users.find({ age: { $lt: 40 } })

5. Query Users with Multiple Conditions (e.g., Age and Name)

To find users based on multiple criteria (e.g., age > 25 and name = "John"):

MongoDB Shell Command:

js
Copy code
db.users.find({ age: { $gt: 25 }, name: "John" }).pretty()

This query will return all users who are older than 25 and have the name John.

6. Query Users with a Regular Expression (e.g., Name Starts with "J")

If you want to find users whose name starts with "J", you can use a regular expression in
MongoDB.

MongoDB Shell Command:

js
Copy code
db.users.find({ name: { $regex: "^J", $options: "i" } }).pretty()

This query uses the regular expression ^J to match names that start with "J". The $options:
"i" makes the search case-insensitive.

7. Sorting Query Results (e.g., Sort by Age)

You can sort the query results by a field. For example, to get users sorted by age in ascending
order:

MongoDB Shell Command:

js
Copy code
db.users.find().sort({ age: 1 }).pretty()

 { age: 1 } sorts the result by age in ascending order.

 To sort in descending order, use { age: -1 }.

Example for descending order:

js
Copy code
db.users.find().sort({ age: -1 }).pretty()
8. Limit the Number of Results

If you only want a specific number of users returned, you can limit the results using the limit()
method. For example, to retrieve the first 3 users:

MongoDB Shell Command:

js
Copy code
db.users.find().limit(3).pretty()

This will return the first 3 users from the users collection.

9. Aggregation Queries (e.g., Group Users by Age)

MongoDB provides the aggregation framework to perform complex queries like grouping,
filtering, and transforming data. For example, you can group users by their age:

MongoDB Aggregation Query:

js
Copy code
db.users.aggregate([
{ $group: { _id: "$age", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])

This query will:

 Group users by age (using $group).

 Count the number of users in each age group ($sum: 1).
 Sort the results by the count in descending order ($sort).

10. Update a User's Information

To update a user's information (e.g., changing the name of a user with a specific _id):

MongoDB Shell Command:

js
Copy code
db.users.updateOne(
{ _id: ObjectId("user-id-here") },
{ $set: { name: "Updated Name" } }
)

This command will update the name field of the user with the provided _id. Use $set to modify
specific fields without affecting others.

11. Delete a User by ID

To delete a user by their _id:

MongoDB Shell Command:

js
Copy code
db.users.deleteOne({ _id: ObjectId("user-id-here") })

This command deletes the user with the specified _id. You can also use deleteMany() if you
want to delete multiple users that match certain conditions.

Output
Query All Users:

[
{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
},
{
"_id" : ObjectId("650d6509341234567890def"),
"name" : "Bob",
"age" : 25
}
]

Query a Single User by ID:

{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
}
Result

In this guide, we've demonstrated several MongoDB query techniques to interact with the users
collection. These include basic queries, conditional queries, sorting, limiting, and aggregation
queries. MongoDB's flexibility allows you to perform powerful and efficient data retrieval and
manipulation tasks using these techniques.

4. Create a Event Stream with Apache Kafka

Objective:

Create an event stream using Apache Kafka by setting up a producer, consumer, and topic.

Procedure:

1. Setup Apache Kafka:

1. Download and Extract Kafka:

bash
Copy code
wget https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz
tar -xvzf kafka_2.13-3.4.0.tgz
cd kafka_2.13-3.4.0

2. Start Zookeeper (if using Kafka 2.7.x or below):

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties

3. Start Kafka:

bash
Copy code
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic:

bash
Copy code
bin/kafka-topics.sh --create --topic event-stream --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1
3. Produce Events to the Topic:

Option 1: Console Producer

bash
Copy code
bin/kafka-console-producer.sh --topic event-stream --bootstrap-server
localhost:9092

 Type events (e.g., Event 1: User logged in) and press Enter.

Option 2: Java Producer

EventProducer.java:

java
Copy code
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
public class EventProducer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("event-stream", "key", "Event 1"));
producer.close();
}
}

 Compile & Run:

bash
Copy code
javac EventProducer.java
java EventProducer

4. Consume Events from the Topic:

Option 1: Console Consumer

bash
Copy code
bin/kafka-console-consumer.sh --topic event-stream --bootstrap-server
localhost:9092 --from-beginning
Option 2: Java Consumer

EventConsumer.java:

java
Copy code
import org.apache.kafka.clients.consumer.*;
import java.util.Properties;
public class EventConsumer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "group1");
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("event-stream"));
consumer.poll(1000).forEach(record ->
System.out.println(record.value()));
}
}

 Compile & Run:

bash
Copy code
javac EventConsumer.java
java EventConsumer

5. Clean Up:

 Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

 Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

Output
... # Download logs
kafka_2.13-3.4.0/
kafka_2.13-3.4.0/LICENSE
kafka_2.13-3.4.0/NOTICE
kafka_2.13-3.4.0/bin/
... # List of extracted files

[2024-11-20 10:21:00,000] INFO Shutting down zookeeper (org.apache.zookeeper.server)

[2024-11-20 10:21:01,000] INFO Zookeeper stopped. (org.apache.zookeeper.server)

Result:

You’ve successfully set up a Kafka event stream by creating a producer, consumer, and topic.
You’ve also learned how to send and consume events from Kafka using both the console and
Java APIs.

5. Create a Real-time Stream processing application using Spark

Streaming
Objective:

Build a real-time stream processing application using Apache Spark Streaming and Kafka.

Procedure:

1. Set Up Apache Kafka & Spark:

1. Download and Extract Spark:

o Download from Apache Spark Downloads.
2. Start Kafka:
o Start Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties

o Start Kafka:

bash
Copy code
bin/kafka-server-start.sh config/server.properties

3. Create Kafka Topic:

bash
Copy code
bin/kafka-topics.sh --create --topic stream-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

2. Spark Streaming Application:

Maven Dependency (if using Maven):

xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>

Java Example: Spark Streaming with Kafka:

SparkStreamingKafka.java:

java
Copy code
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.util.HashMap;
import java.util.Map;

public class SparkStreamingKafka {

public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("SparkStreamingKafka");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5));

String bootstrapServers = "localhost:9092";

String groupId = "spark-streaming-group";
String topic = "stream-topic";
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

kafkaStream.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Message: " +
record.value()));
});

ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile (if using Maven):

bash
Copy code
mvn clean package

2. Submit the Spark Application:

bash
Copy code
./bin/spark-submit --class SparkStreamingKafka --master local[2]
target/your-app-jar-file.jar

4. Send Data to Kafka:

Use Kafka Producer to send test messages:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic stream-
topic

Example messages:

makefile
Copy code
Hello, Spark Streaming!
Event: 1
Event: 2

5. Verify Stream Processing:

After running the Spark app, you should see messages in the console:

vbnet
Copy code
Message: Hello, Spark Streaming!
Message: Event: 1
Message: Event: 2

6. Enhance Processing (Optional):

Filter messages containing the word "Event":

java
Copy code
kafkaStream
.map(record -> record.value())
.filter(message -> message.contains("Event"))
.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Filtered: " + record));
});

7. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh
Output
1. Kafka & Spark Setup:
- Start Kafka & Zookeeper:
[INFO] Starting Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server started on port 9092

2. Spark Streaming Application:

- Spark application started:
[INFO] Spark Streaming initialized with batch duration 5 seconds.
[INFO] Subscribed to topic stream-topic.

3. Send Data to Kafka:

> Hello, Spark Streaming!
> Event: 1
> Event: 2

4. Verify Stream Processing:

Message: Hello, Spark Streaming!
Message: Event: 1
Message: Event: 2

5. Clean Up:
- Stop Kafka & Zookeeper:
[INFO] Stopping Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server stopped.

Result:

You have successfully built a real-time stream processing application using Apache Spark
Streaming and Kafka. This setup reads messages from a Kafka topic, processes them in real
time, and can be expanded to include more complex transformations or output to various sinks.
6. Build a Micro-batch application
Objective:

Create a micro-batch application using Apache Spark Streaming that processes incoming
data in small batches (windows), performs transformations, and outputs results.

In Spark Streaming, data is processed in small batches of a configurable size (e.g., 1 second, 5
seconds). These small time-based units of processing are known as micro-batches.

Steps to Build a Micro-batch Application with Spark Streaming

1. Set Up Apache Kafka (or another data source):

1. Install Apache Kafka if it's not already installed. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create a Kafka Topic for testing:

bash
Copy code
bin/kafka-topics.sh --create --topic microbatch-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

3. Start Kafka Producer to send data to the topic:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
microbatch-topic

Example messages:

vbnet
Copy code
Event 1
Event 2
Event 3

2. Spark Streaming Application Code:

You’ll create a Spark Streaming application that consumes data from Kafka and processes it in
micro-batches. In this example, we'll process data every 5 seconds.

Maven Dependency (if using Maven):

Java Code Example: Micro-batch Application:

MicroBatchStreaming.java:

java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;

import java.util.HashMap;
import java.util.Map;

public class MicroBatchStreaming {

public static void main(String[] args) throws Exception {

// 1. Spark configuration
SparkConf conf = new SparkConf().setAppName("MicroBatchApp");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval of 5 seconds

// 2. Kafka parameters
String bootstrapServers = "localhost:9092";
String groupId = "microbatch-group";
String topic = "microbatch-topic";

// 3. Set Kafka consumer properties

Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

// 4. Define the Kafka Stream (micro-batch)

InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// 5. Processing Data - For each micro-batch

kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
System.out.println("Processing batch:");
rdd.foreach(record -> {
System.out.println("Event: " + record.value());
});
}
});

// 6. Start the stream processing

ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run the Application:

1. Compile the Java application (if using Maven or SBT):

bash
Copy code
mvn clean package

2. Submit the Spark application:

bash
Copy code
./bin/spark-submit --class MicroBatchStreaming --master local[2]
target/your-app-jar-file.jar

4. Send Data to Kafka:

Send test data through the Kafka producer to simulate incoming events:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic microbatch-
topic

Example data:

vbnet
Copy code
Event 1
Event 2
Event 3

5. Monitor the Output:

Once the application starts, Spark will process the incoming events every 5 seconds. You should
see the following output for each micro-batch:

vbnet
Copy code
Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3

6. Enhancing the Application (Optional):

You can add more processing logic such as filtering, aggregation, or windowed operations in the
foreachRDD block. For example, to count the number of events in each micro-batch:

java
Copy code
kafkaStream
.map(record -> record.value())
.count()
.foreachRDD(rdd -> {
rdd.collect().forEach(count -> {
System.out.println("Batch size: " + count);
});
});
7. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming (gracefully terminate the Spark job):

bash
Copy code
ssc.stop();

Output
1. Kafka & Spark Setup:
- Zookeeper started on port 2181.
- Kafka Server started on port 9092.
- Created topic: microbatch-topic.

2. Spark Streaming Application:

- Spark initialized with 5-second micro-batch interval.
- Subscribed to topic: microbatch-topic.

3. Send Data to Kafka:

- Kafka Producer Messages:
> Event 1
> Event 2
> Event 3

4. Outcomes:
- Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3

5. Clean Up:
- Kafka and Zookeeper stopped.
Result:

You’ve successfully created a micro-batch processing application using Spark Streaming and
Kafka. This application processes data in small, time-based intervals (micro-batches) and can be
extended with complex transformations, aggregations, or output to external systems (e.g.,
databases, HDFS).

7. Real-time Fraud and Anomaly Detection,

Objective:

Build a real-time fraud detection system using Apache Spark Streaming to process
incoming data (e.g., financial transactions) and detect anomalies or fraudulent activity.

Steps to Implement Real-Time Fraud Detection

1. Set Up Kafka (or another data source):

1. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic for transactions:

bash
Copy code
bin/kafka-topics.sh --create --topic transactions --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

3. Kafka Producer to send transaction data:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
transactions

Example transaction data:

makefile
Copy code
accountID:1234,amount:1000.0,location:NYC,ip:192.168.1.1,timestamp:16171
87362
accountID:5678,amount:5000.0,location:LA,ip:192.168.1.2,timestamp:161718
7462

2. Build Spark Streaming Application:

Maven Dependencies:

Java Code: Real-time Fraud Detection:

FraudDetectionStream.java:

import java.util.HashMap;
import java.util.Map;

public class FraudDetectionStream {

public static void main(String[] args) throws Exception {

// Spark Configuration
SparkConf conf = new SparkConf().setAppName("FraudDetection");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval

// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "fraud-detection-group";
String topic = "transactions";

Map<String, Object> kafkaParams = new HashMap<>();

kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

// Define Kafka Stream

InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// Process each micro-batch

kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
rdd.foreach(record -> {
String transaction = record.value();
String[] fields = transaction.split(",");
double amount = Double.parseDouble(fields[1].split(":")
[1]);

// Example fraud detection: Flag transactions > $2000

if (amount > 2000) {
System.out.println("Potential Fraud Detected: " +
transaction);
}
});
}
});

// Start the streaming context

ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile the Java application (if using Maven):

bash
Copy code
mvn clean package

2. Submit Spark Application:

bash
Copy code
./bin/spark-submit --class FraudDetectionStream --master local[2]
target/your-app-jar-file.jar

4. Monitor Output:

Once the Spark job starts, it will process the incoming transactions. Transactions over $2000 will
be flagged as potentially fraudulent.

Example output:

sql
Copy code
Potential Fraud Detected:
accountID:5678,amount:5000.0,location:LA,ip:192.168.1.2,timestamp:1617187462

5. Enhance Fraud Detection Logic:

You can enhance the detection logic using more sophisticated techniques:

 Statistical Anomaly Detection: Use moving averages or Z-scores for detecting abnormal
spending patterns.
 Machine Learning Models: Train an anomaly detection model and use Spark Streaming
to apply it in real-time.
 Time-Series Analysis: Analyze transaction frequency, location, and patterns over time.

Example: Using Z-score for anomaly detection:

java
Copy code
double mean = ...; // Calculate mean of transactions
double stdDev = ...; // Calculate standard deviation
double zScore = (amount - mean) / stdDev;
if (Math.abs(zScore) > 3) {
System.out.println("Anomalous transaction detected: " + transaction);
}

6. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming:

bash
Copy code
ssc.stop();
Output

Result:

You’ve built a real-time fraud detection system using Apache Spark Streaming and Kafka.
This application processes transaction data in micro-batches and flags potentially fraudulent
transactions based on simple thresholds. For a more robust system, consider using machine
learning, statistical models, or time-series analysis for more sophisticated fraud detection.

8. Real-time personalization, Marketing, Advertising

Objective:

Build a real-time personalization system using Apache Spark Streaming to tailor

marketing and advertising content based on user behavior and interaction data.
Steps to Implement the System:

1. Set Up Kafka (or another data source):

1. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic for user interactions:

bash
Copy code
bin/kafka-topics.sh --create --topic user-interactions --bootstrap-
server localhost:9092 --partitions 1 --replication-factor 1

3. Kafka Producer to simulate user interactions:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic user-
interactions

Example interaction data:

sql
Copy code
userID:1234,action:click,category:electronics,productID:5678,timestamp:1
617187362
userID:5678,action:view,category:clothing,productID:1234,timestamp:16171
87462

2. Build Spark Streaming Application:

Maven Dependency:

Include dependencies for Spark Streaming and Kafka integration:

Java Code: Real-time Personalization:

RealTimePersonalization.java:

import java.util.HashMap;
import java.util.Map;

public class RealTimePersonalization {

public static void main(String[] args) throws Exception {
// Spark Configuration
SparkConf conf = new
SparkConf().setAppName("RealTimePersonalization");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval

// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "personalization-group";
String topic = "user-interactions";

Map<String, Object> kafkaParams = new HashMap<>();

// Define Kafka Stream

InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// Process each micro-batch and personalize content

kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
rdd.foreach(record -> {
String interaction = record.value();
String[] fields = interaction.split(",");
String userID = fields[0].split(":")[1];
String action = fields[1].split(":")[1];
String category = fields[2].split(":")[1];
String productID = fields[3].split(":")[1];

// Example personalization: Show relevant ads or

recommendations
if (action.equals("click") || action.equals("view")) {
System.out.println("Personalized Ad/Recommendation for
User " + userID + ": "
+ "Category: " + category + ",
Product: " + productID);
}
});
}
});

// Start streaming context

ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile the Java application:

bash
Copy code
mvn clean package

2. Submit Spark Application:

bash
Copy code
./bin/spark-submit --class RealTimePersonalization --master local[2]
target/your-app-jar-file.jar

4. Monitor and Validate Output:

Once the Spark job is running, it will process incoming user interactions and personalize content.

Example output:

yaml
Copy code
Personalized Ad/Recommendation for User 1234: Category: electronics, Product:
5678
Personalized Ad/Recommendation for User 5678: Category: clothing, Product:
1234

5. Enhance the Personalization Engine:

 User Profiling: Combine user behavior data (e.g., past purchases, browsing) to create
richer profiles.
 Collaborative Filtering: Use algorithms like ALS (Alternating Least Squares) to
recommend products based on similar users' interactions.
 Machine Learning: Apply predictive models to tailor ads and recommendations based
on user preferences and interactions.

6. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming:

bash
Copy code
ssc.stop();

Output
Result:

You've created a real-time personalization system using Apache Spark Streaming and
Kafka. This system processes user interaction data (e.g., clicks, views, purchases) to deliver
personalized content such as ads and recommendations. To improve, you can incorporate
machine learning models, collaborative filtering, or advanced user profiling to enhance the
personalization experience for users.

Project Report of Migrating To Cloud Bas
100% (1)
Project Report of Migrating To Cloud Bas
32 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Tasks Grit
No ratings yet
Tasks Grit
2 pages
Cloud Computing Lab Manual-New
No ratings yet
Cloud Computing Lab Manual-New
150 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
CCS335-Cloud-Computing-QB - Unit 3, 4 & 5
No ratings yet
CCS335-Cloud-Computing-QB - Unit 3, 4 & 5
57 pages
Quiz Application in C#
100% (1)
Quiz Application in C#
9 pages
Content Beyond Syllabus New(1)
No ratings yet
Content Beyond Syllabus New(1)
56 pages
Crowd Sourcing Analytics
100% (1)
Crowd Sourcing Analytics
27 pages
Streamprocessing Labmanual
No ratings yet
Streamprocessing Labmanual
48 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
Unit 3 - FSW - Important Ques With Ans
No ratings yet
Unit 3 - FSW - Important Ques With Ans
36 pages
Cs3591 CN Unit 4 Notes Eduengg
No ratings yet
Cs3591 CN Unit 4 Notes Eduengg
21 pages
Practical File Cloud Computing IT-704
No ratings yet
Practical File Cloud Computing IT-704
27 pages
Programming in C - CS3251 - HandWritten Notes - Un_250316_200237
No ratings yet
Programming in C - CS3251 - HandWritten Notes - Un_250316_200237
38 pages
Ccs368-Stream Processing Lab Manual
No ratings yet
Ccs368-Stream Processing Lab Manual
50 pages
Security Trends, Legal, Ethical and Professional Aspects of Security
No ratings yet
Security Trends, Legal, Ethical and Professional Aspects of Security
3 pages
CS2302 Computer Networks Anna University Engineering Question Bank 4 U
No ratings yet
CS2302 Computer Networks Anna University Engineering Question Bank 4 U
48 pages
Introduction
No ratings yet
Introduction
25 pages
CP4291 SYLLABUS
No ratings yet
CP4291 SYLLABUS
3 pages
CCS372 - VIR Syllabus
No ratings yet
CCS372 - VIR Syllabus
1 page
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
FSD Unit - 3 - Part-1
No ratings yet
FSD Unit - 3 - Part-1
15 pages
MCQ
No ratings yet
MCQ
11 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
CCS354 Network Security
No ratings yet
CCS354 Network Security
87 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
21CSE354T - Full Stack Web Development Question Bank (1)
100% (1)
21CSE354T - Full Stack Web Development Question Bank (1)
9 pages
CCS336 Set3
No ratings yet
CCS336 Set3
2 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Cloud Computing Chapter-11
No ratings yet
Cloud Computing Chapter-11
15 pages
BDA Unit2 Complete
No ratings yet
BDA Unit2 Complete
56 pages
CCS365 - SDN Notes
No ratings yet
CCS365 - SDN Notes
7 pages
CC - ML Course Plan-2024-25
No ratings yet
CC - ML Course Plan-2024-25
6 pages
M.E.cse - R21 Syllabus
No ratings yet
M.E.cse - R21 Syllabus
20 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
No ratings yet
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
40 pages
Lab Record-Cs3401 Algorithms
No ratings yet
Lab Record-Cs3401 Algorithms
79 pages
Storage Technologies
0% (1)
Storage Technologies
2 pages
CCS334 BDA Practical Question
No ratings yet
CCS334 BDA Practical Question
2 pages
UNIT - V - VIRTUALIZATION TOOLS
No ratings yet
UNIT - V - VIRTUALIZATION TOOLS
30 pages
Big Data Analytics TEXTBOOK
No ratings yet
Big Data Analytics TEXTBOOK
230 pages
Unit 1 PPT CC
No ratings yet
Unit 1 PPT CC
38 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
IT2201 - Data Structures and Algorithms - Anna University - Previous Year Question Papers
No ratings yet
IT2201 - Data Structures and Algorithms - Anna University - Previous Year Question Papers
30 pages
Business Analytics CCW331 Tech Publication (f16's Yht)[1]
No ratings yet
Business Analytics CCW331 Tech Publication (f16's Yht)[1]
233 pages
Question Bank - Unit I
No ratings yet
Question Bank - Unit I
2 pages
3.multicore Architecture and Programming
0% (1)
3.multicore Architecture and Programming
3 pages
Chapter 06 Part1
No ratings yet
Chapter 06 Part1
20 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Lab Manual: Sri Ramakrishna Institute of Technology
No ratings yet
Lab Manual: Sri Ramakrishna Institute of Technology
49 pages
Course Outcomes: Pre-Requisites Co/Po Mapping
No ratings yet
Course Outcomes: Pre-Requisites Co/Po Mapping
2 pages
MC4102 OOSE Question bank
No ratings yet
MC4102 OOSE Question bank
4 pages
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
0% (1)
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
5 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CS3492 DBMS Univ - QP Answer AM 2024
No ratings yet
CS3492 DBMS Univ - QP Answer AM 2024
19 pages
Cs3481 - Dbms Record
No ratings yet
Cs3481 - Dbms Record
63 pages
Advanced Databases - Unit - V - PPT
No ratings yet
Advanced Databases - Unit - V - PPT
71 pages
Instant Ebooks Textbook Cognitive Computing Theory and Applications 1st Edition Venkat N. Gudivada Download All Chapters
100% (6)
Instant Ebooks Textbook Cognitive Computing Theory and Applications 1st Edition Venkat N. Gudivada Download All Chapters
84 pages
Virtual Ization
No ratings yet
Virtual Ization
3 pages
PPSC Lecturer of Computer Science Old Paper
No ratings yet
PPSC Lecturer of Computer Science Old Paper
13 pages
SANS Cheat Sheet 1662156164
No ratings yet
SANS Cheat Sheet 1662156164
1 page
Functions in Economics
No ratings yet
Functions in Economics
60 pages
Input Output New Type Series Based
No ratings yet
Input Output New Type Series Based
36 pages
Python Course Syllabus
No ratings yet
Python Course Syllabus
5 pages
Design Ladder Diagram For Traffic Module: Lab No. 08
No ratings yet
Design Ladder Diagram For Traffic Module: Lab No. 08
3 pages
SQL Lesson
100% (1)
SQL Lesson
56 pages
Network+ Certification Bible PDF
100% (1)
Network+ Certification Bible PDF
738 pages
Atheros AR5B95 - TechInfoDepot
No ratings yet
Atheros AR5B95 - TechInfoDepot
2 pages
Product Reqs Document (PRD) Template
No ratings yet
Product Reqs Document (PRD) Template
3 pages
CV Mghenja
No ratings yet
CV Mghenja
4 pages
Intellectual Property Utility Model
No ratings yet
Intellectual Property Utility Model
5 pages
Mathematical Fundamentals For ML - 1
No ratings yet
Mathematical Fundamentals For ML - 1
1 page
Javascript Mod3
No ratings yet
Javascript Mod3
199 pages
SWOT Analysis Example
No ratings yet
SWOT Analysis Example
2 pages
T66 2 Ecam 81995985292 Eng
No ratings yet
T66 2 Ecam 81995985292 Eng
104 pages
Lesson Plan MPF-10 Deployment Systems
No ratings yet
Lesson Plan MPF-10 Deployment Systems
9 pages
GRADES 1 To 12 Daily Lesson Log Monday Tuesday Wednesday Thursday Friday
No ratings yet
GRADES 1 To 12 Daily Lesson Log Monday Tuesday Wednesday Thursday Friday
8 pages
Transportation
No ratings yet
Transportation
8 pages
Unit 08 - Computer and Internet Etiquette
No ratings yet
Unit 08 - Computer and Internet Etiquette
4 pages
Paper_1
No ratings yet
Paper_1
19 pages
10 Math Arithmetic Progression
No ratings yet
10 Math Arithmetic Progression
7 pages
Ian Ngwalo - Ebusiness
No ratings yet
Ian Ngwalo - Ebusiness
23 pages
Docker Containers Versus Virtual Machine-Based Virtualization: Proceedings of IEMIS 2018, Volume 3
No ratings yet
Docker Containers Versus Virtual Machine-Based Virtualization: Proceedings of IEMIS 2018, Volume 3
11 pages
OceanStor Dorado 6.x & OceanStor 6.x DM-Multipath Configuration Guide for Citrix XenServer
No ratings yet
OceanStor Dorado 6.x & OceanStor 6.x DM-Multipath Configuration Guide for Citrix XenServer
12 pages
Monte Carlo Optimization
No ratings yet
Monte Carlo Optimization
13 pages
GPRS Tunneling Protocol GTP
No ratings yet
GPRS Tunneling Protocol GTP
22 pages
Eaom 7 Man Eng v00
No ratings yet
Eaom 7 Man Eng v00
40 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.