0% found this document useful (0 votes)
142 views

Stream Processing Lab Manual

Uploaded by

kaleeswari090204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views

Stream Processing Lab Manual

Uploaded by

kaleeswari090204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

1.

Install MongoDB

Aim:

MongoDB is a NoSQL database that stores data in a flexible, JSON-like format. Below are the
steps to install MongoDB on different platforms.

1. Install MongoDB on Ubuntu (Linux)

Step 1: Update the Package Database

bash
Copy code
sudo apt update

Step 2: Install MongoDB Dependencies

Install the necessary dependencies:

bash
Copy code
sudo apt install -y libcurl4 openssl liblzma5

Step 3: Add MongoDB Repository

To install MongoDB, add the MongoDB repository to your system:

1. Import the MongoDB public key:

bash
Copy code
wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-
key add -

2. Add the MongoDB repository to your sources list:

bash
Copy code
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu
focal/mongodb-org/5.0 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodb-org-5.0.list

Step 4: Install MongoDB


Update the local package database again and install MongoDB:

bash
Copy code
sudo apt update
sudo apt install -y mongodb-org

Step 5: Start MongoDB Service

Start the MongoDB service:

bash
Copy code
sudo systemctl start mongod

Step 6: Enable MongoDB to Start on Boot

Enable MongoDB to start automatically when the system boots:

bash
Copy code
sudo systemctl enable mongod

Step 7: Verify MongoDB Installation

Check the status of MongoDB to ensure it's running:

bash
Copy code
sudo systemctl status mongod

You can also connect to the MongoDB shell by typing:

bash
Copy code
mongo

2. Install MongoDB on macOS

Step 1: Install Homebrew

If you don't have Homebrew installed, install it by running:

bash
Copy code
/bin/bash -c "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Tap the MongoDB Formula


Run the following command to tap the official MongoDB formula:

bash
Copy code
brew tap mongodb/brew

Step 3: Install MongoDB

Now, install MongoDB using Homebrew:

bash
Copy code
brew install mongodb-community@5.0

Step 4: Start MongoDB

Start MongoDB using Homebrew services:

bash
Copy code
brew services start mongodb/brew/mongodb-community

Step 5: Verify Installation

To verify the MongoDB service is running:

bash
Copy code
brew services list

To access the MongoDB shell:

bash
Copy code
mongo

3. Install MongoDB on Windows

Step 1: Download MongoDB Installer

 Go to the MongoDB Download Center.


 Choose Windows as the operating system and download the MSI installer.

Step 2: Run the Installer

 Launch the downloaded .msi file.


 Follow the installation wizard:
o Choose Complete setup.
o Make sure Install MongoDB as a Service is selected.
o Enable Run MongoDB as a Service and set it to start automatically.

Step 3: Add MongoDB to the PATH

 During installation, make sure the option to add MongoDB to your system's PATH environment
variable is selected.

Step 4: Start MongoDB

MongoDB should start automatically as a service. If not, you can manually start it:

1. Open a Command Prompt and run:

bash
Copy code
net start MongoDB

Step 5: Verify MongoDB Installation

To check if MongoDB is running, open a Command Prompt and type:

bash
Copy code
mongo

4. Test MongoDB Installation

To check if MongoDB is working properly, you can connect to the MongoDB shell. Type the
following command in your terminal or command prompt:

bash
Copy code
mongo

This will start the MongoDB shell, and you'll see a prompt like this:

markdown
Copy code
>

You can then run commands like show dbs to see the databases or use test to switch to the
test database.

5. (Optional) Configure MongoDB for Remote Access


If you want to enable remote access to your MongoDB instance (for example, from a different
machine), you will need to:

1. Edit the MongoDB configuration file (mongod.conf):


o On Linux and macOS, this file is typically located at /etc/mongod.conf.
o On Windows, find the MongoDB config file mongod.cfg.

2. Bind MongoDB to All IP Addresses:


o Open the mongod.conf file and find the bindIp field under net.
o Set it to 0.0.0.0 to allow connections from any IP address (be cautious as this opens
up MongoDB to the entire network):

yaml
Copy code
net:
bindIp: 0.0.0.0

3. Restart MongoDB:
o Restart the MongoDB service to apply the changes:
 On Linux/macOS:

bash
Copy code
sudo systemctl restart mongod

 On Windows, restart the MongoDB service via the Services application or using
net start MongoDB.

6. (Optional) Secure MongoDB with Authentication

If you want to secure your MongoDB instance:

1. Create an Admin User: Open the MongoDB shell and run the following commands:

bash
Copy code
use admin
db.createUser({ user: "admin", pwd: "password", roles: [{ role:
"userAdminAnyDatabase", db: "admin" }] })

2. Enable Authentication:
o Edit the mongod.conf file and enable authentication:

yaml
Copy code
security:
authorization: "enabled"
3. Restart MongoDB:
o Restart MongoDB to apply the changes.

Now, MongoDB requires authentication. To connect:

bash
Copy code
mongo -u admin -p password --authenticationDatabase admin

output

Here's a simplified output you might see when installing MongoDB on Ubuntu:

Bash
Reading package lists... Done
Building dependency tree
Reading state information
The following NEW packages will be installed: 1

mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-


org-tools
0 upgraded, 5 newly installed, 0 to remove and 0 not upgraded.
Need to get 133 MB of archives.
After this operation, 322 MB of additional disk space will be used.
Get:1 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-server amd64 5.0.15 [67.0 MB]
Get:2 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-shell amd64 5.0.15 [28.6 MB]
Get:3 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-mongos amd64 5.0.15 [18.9 MB]
Get:4 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org-tools amd64 5.0.15 [18.5 MB]
Get:5 https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse
mongodb-org amd64 5.0.15 [1.0 MB]
Fetched 133 MB in 9s (14.7 MB/s)
Selecting previously unselected package mongodb-org-server.
(Reading database ... 166866 files and directories currently installed.)
Preparing to unpack .../mongodb-org-server_5.0.15_amd64.deb ...
Unpacking mongodb-org-server (5.0.15) ...
[... other packages being unpacked ...]
Setting up mongodb-org-server (5.0.15) ...
Setting up mongodb-org-shell (5.0.15) ...
Setting up mongodb-org-mongos (5.0.15) ...
Setting up mongodb-org-tools (5.0.15) ...
Setting up mongodb-org (5.0.15) ...
After the installation:

Bash
mongod --version
MongoDB shell version v5.0.15

Result:

You've successfully installed MongoDB on your system. MongoDB is now ready for use in
storing, querying, and managing your NoSQL data. You can now integrate it into your
applications or use it directly for data storage in real-time applications like analytics, logging, or
content management.

2. Design and Implement Simple application using


MongoDB

This guide demonstrates how to build a simple CRUD (Create, Read, Update, Delete)
application using MongoDB and Node.js. The app will perform basic operations on a MongoDB
database to manage user data (name, email, and age).

Step 1: Set Up Node.js Project

1. Create a new directory and initialize a Node.js project:

bash
Copy code
mkdir mongo-crud-app
cd mongo-crud-app
npm init -y

2. Install necessary dependencies:

bash
Copy code
npm install express mongodb
Step 2: Create Application

Create a file app.js and set up the server and MongoDB connection.

javascript
Copy code
const express = require('express');
const { MongoClient, ObjectId } = require('mongodb');
const app = express();

app.use(express.json());

// MongoDB connection URL and database name


const url = 'mongodb://localhost:27017';
const dbName = 'crudApp';
let db;

// Connect to MongoDB
MongoClient.connect(url, { useNewUrlParser: true, useUnifiedTopology: true })
.then(client => {
console.log('Connected to MongoDB');
db = client.db(dbName);
})
.catch(err => console.error('Failed to connect to MongoDB:', err));

// Routes
app.get('/', (req, res) => res.send('MongoDB CRUD Application'));

app.post('/users', async (req, res) => {


try {
const user = req.body;
const result = await db.collection('users').insertOne(user);
res.status(201).json(result.ops[0]);
} catch (err) {
res.status(500).json({ error: 'Failed to create user' });
}
});

app.get('/users', async (req, res) => {


try {
const users = await db.collection('users').find().toArray();
res.status(200).json(users);
} catch (err) {
res.status(500).json({ error: 'Failed to fetch users' });
}
});

app.put('/users/:id', async (req, res) => {


try {
const { id } = req.params;
const updatedUser = req.body;
const result = await db.collection('users').updateOne(
{ _id: new ObjectId(id) },
{ $set: updatedUser }
);
if (result.matchedCount > 0) {
res.status(200).json({ message: 'User updated' });
} else {
res.status(404).json({ message: 'User not found' });
}
} catch (err) {
res.status(500).json({ error: 'Failed to update user' });
}
});

app.delete('/users/:id', async (req, res) => {


try {
const { id } = req.params;
const result = await db.collection('users').deleteOne({ _id: new
ObjectId(id) });
if (result.deletedCount > 0) {
res.status(200).json({ message: 'User deleted' });
} else {
res.status(404).json({ message: 'User not found' });
}
} catch (err) {
res.status(500).json({ error: 'Failed to delete user' });
}
});

// Start server
const port = 3000;
app.listen(port, () => console.log(`Server is running on http://localhost:$
{port}`));

Step 3: Test the Application

Start the server by running:

bash
Copy code
node app.js

Use tools like Postman or cURL to test CRUD operations.

1. Create User (POST /users): Example request body:

json
Copy code
{ "name": "John Doe", "email": "john@example.com", "age": 30 }

2. Get Users (GET /users): Fetch all users.


3. Update User (PUT /users/:id): Example body for update:

json
Copy code
{ "name": "John Smith", "email": "john.smith@example.com", "age": 31 }
4. Delete User (DELETE /users/:id): Delete user by their unique _id.

Step 4: Verify Data in MongoDB

To verify the data in MongoDB, open the shell and run:

bash
Copy code
mongo
use crudApp
db.users.find().pretty()

Output

> show dbs

admin 0.000GB

local 0.000GB

> use myDatabase

switched to db myDatabase

> db.createCollection("users")

{ "ok" : 1 }

> db.users.insertOne({ name: "Alice", age: 30 })

{ "acknowledged" : true, "insertedId" : ObjectId("650d6509341234567890abc") }

> db.users.find()

"_id" : ObjectId("650d6509341234567890abc"),

"name" : "Alice",

"age" : 30

Result:
This simple application demonstrates how to perform basic CRUD operations with MongoDB
and Node.js. You can extend this by adding validation, authentication, or advanced MongoDB
features like aggregation. This CRUD application can be further built upon to create complex
systems for real-world applications.

3. Query the designed system using MongoDB

Aim:

Once you've set up your CRUD application using MongoDB and Node.js, you may want to
query the system to retrieve, update, or delete data based on specific conditions. In this section,
we'll explore various MongoDB query operations that you can perform on your database.

Procedure:

1. Query All Users

To retrieve all users in the users collection:

MongoDB Shell Command:

js
Copy code
db.users.find().pretty()

This command will return all documents (users) in the users collection, formatted neatly for
better readability.

2. Query a Single User by ID

To find a user by their unique _id field:

MongoDB Shell Command:

js
Copy code
db.users.find({ _id: ObjectId("user-id-here") }).pretty()
Replace "user-id-here" with the actual _id of the user you want to query. In MongoDB, the
_id is typically an ObjectId, which you can generate or retrieve programmatically.

Example with a sample _id:

js
Copy code
db.users.find({ _id: ObjectId("613b2a5b9e24f53c6a05689b") }).pretty()

3. Query Users by Specific Field (e.g., name)

If you want to retrieve users based on a specific field, like name:

MongoDB Shell Command:

js
Copy code
db.users.find({ name: "John Doe" }).pretty()

This will return all users whose name is John Doe.

4. Query Users with Conditions (e.g., Age Greater Than 25)

To retrieve users who meet specific conditions (e.g., age > 25):

MongoDB Shell Command:

js
Copy code
db.users.find({ age: { $gt: 25 } }).pretty()

This will return all users where the age field is greater than 25. You can use various comparison
operators such as $lt (less than), $gte (greater than or equal to), $lte (less than or equal to),
and $ne (not equal to).

Example:

 Greater than 30: db.users.find({ age: { $gt: 30 } })


 Less than 40: db.users.find({ age: { $lt: 40 } })

5. Query Users with Multiple Conditions (e.g., Age and Name)


To find users based on multiple criteria (e.g., age > 25 and name = "John"):

MongoDB Shell Command:

js
Copy code
db.users.find({ age: { $gt: 25 }, name: "John" }).pretty()

This query will return all users who are older than 25 and have the name John.

6. Query Users with a Regular Expression (e.g., Name Starts with "J")

If you want to find users whose name starts with "J", you can use a regular expression in
MongoDB.

MongoDB Shell Command:

js
Copy code
db.users.find({ name: { $regex: "^J", $options: "i" } }).pretty()

This query uses the regular expression ^J to match names that start with "J". The $options:
"i" makes the search case-insensitive.

7. Sorting Query Results (e.g., Sort by Age)

You can sort the query results by a field. For example, to get users sorted by age in ascending
order:

MongoDB Shell Command:

js
Copy code
db.users.find().sort({ age: 1 }).pretty()

 { age: 1 } sorts the result by age in ascending order.


 To sort in descending order, use { age: -1 }.

Example for descending order:

js
Copy code
db.users.find().sort({ age: -1 }).pretty()
8. Limit the Number of Results

If you only want a specific number of users returned, you can limit the results using the limit()
method. For example, to retrieve the first 3 users:

MongoDB Shell Command:

js
Copy code
db.users.find().limit(3).pretty()

This will return the first 3 users from the users collection.

9. Aggregation Queries (e.g., Group Users by Age)

MongoDB provides the aggregation framework to perform complex queries like grouping,
filtering, and transforming data. For example, you can group users by their age:

MongoDB Aggregation Query:

js
Copy code
db.users.aggregate([
{ $group: { _id: "$age", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])

This query will:

 Group users by age (using $group).


 Count the number of users in each age group ($sum: 1).
 Sort the results by the count in descending order ($sort).

10. Update a User's Information

To update a user's information (e.g., changing the name of a user with a specific _id):

MongoDB Shell Command:

js
Copy code
db.users.updateOne(
{ _id: ObjectId("user-id-here") },
{ $set: { name: "Updated Name" } }
)

This command will update the name field of the user with the provided _id. Use $set to modify
specific fields without affecting others.

11. Delete a User by ID

To delete a user by their _id:

MongoDB Shell Command:

js
Copy code
db.users.deleteOne({ _id: ObjectId("user-id-here") })

This command deletes the user with the specified _id. You can also use deleteMany() if you
want to delete multiple users that match certain conditions.

Output
Query All Users:

[
{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
},
{
"_id" : ObjectId("650d6509341234567890def"),
"name" : "Bob",
"age" : 25
}
]

Query a Single User by ID:

{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
}
Result

In this guide, we've demonstrated several MongoDB query techniques to interact with the users
collection. These include basic queries, conditional queries, sorting, limiting, and aggregation
queries. MongoDB's flexibility allows you to perform powerful and efficient data retrieval and
manipulation tasks using these techniques.

4. Create a Event Stream with Apache Kafka

Objective:

Create an event stream using Apache Kafka by setting up a producer, consumer, and topic.

Procedure:

1. Setup Apache Kafka:

1. Download and Extract Kafka:

bash
Copy code
wget https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz
tar -xvzf kafka_2.13-3.4.0.tgz
cd kafka_2.13-3.4.0

2. Start Zookeeper (if using Kafka 2.7.x or below):

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties

3. Start Kafka:

bash
Copy code
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic:


bash
Copy code
bin/kafka-topics.sh --create --topic event-stream --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1
3. Produce Events to the Topic:

Option 1: Console Producer

bash
Copy code
bin/kafka-console-producer.sh --topic event-stream --bootstrap-server
localhost:9092

 Type events (e.g., Event 1: User logged in) and press Enter.

Option 2: Java Producer

EventProducer.java:

java
Copy code
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
public class EventProducer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("event-stream", "key", "Event 1"));
producer.close();
}
}

 Compile & Run:

bash
Copy code
javac EventProducer.java
java EventProducer

4. Consume Events from the Topic:

Option 1: Console Consumer

bash
Copy code
bin/kafka-console-consumer.sh --topic event-stream --bootstrap-server
localhost:9092 --from-beginning
Option 2: Java Consumer

EventConsumer.java:

java
Copy code
import org.apache.kafka.clients.consumer.*;
import java.util.Properties;
public class EventConsumer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "group1");
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("event-stream"));
consumer.poll(1000).forEach(record ->
System.out.println(record.value()));
}
}

 Compile & Run:

bash
Copy code
javac EventConsumer.java
java EventConsumer

5. Clean Up:

 Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

 Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

Output
... # Download logs
kafka_2.13-3.4.0/
kafka_2.13-3.4.0/LICENSE
kafka_2.13-3.4.0/NOTICE
kafka_2.13-3.4.0/bin/
... # List of extracted files

[2024-11-20 10:21:00,000] INFO Shutting down zookeeper (org.apache.zookeeper.server)


[2024-11-20 10:21:01,000] INFO Zookeeper stopped. (org.apache.zookeeper.server)

Result:

You’ve successfully set up a Kafka event stream by creating a producer, consumer, and topic.
You’ve also learned how to send and consume events from Kafka using both the console and
Java APIs.

5. Create a Real-time Stream processing application using Spark


Streaming
Objective:

Build a real-time stream processing application using Apache Spark Streaming and Kafka.

Procedure:

1. Set Up Apache Kafka & Spark:

1. Download and Extract Spark:


o Download from Apache Spark Downloads.
2. Start Kafka:
o Start Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties

o Start Kafka:

bash
Copy code
bin/kafka-server-start.sh config/server.properties

3. Create Kafka Topic:

bash
Copy code
bin/kafka-topics.sh --create --topic stream-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

2. Spark Streaming Application:

Maven Dependency (if using Maven):

xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>

Java Example: Spark Streaming with Kafka:

SparkStreamingKafka.java:

java
Copy code
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.util.HashMap;
import java.util.Map;

public class SparkStreamingKafka {


public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("SparkStreamingKafka");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5));

String bootstrapServers = "localhost:9092";


String groupId = "spark-streaming-group";
String topic = "stream-topic";
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

kafkaStream.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Message: " +
record.value()));
});

ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile (if using Maven):

bash
Copy code
mvn clean package

2. Submit the Spark Application:

bash
Copy code
./bin/spark-submit --class SparkStreamingKafka --master local[2]
target/your-app-jar-file.jar

4. Send Data to Kafka:

Use Kafka Producer to send test messages:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic stream-
topic

Example messages:

makefile
Copy code
Hello, Spark Streaming!
Event: 1
Event: 2

5. Verify Stream Processing:

After running the Spark app, you should see messages in the console:

vbnet
Copy code
Message: Hello, Spark Streaming!
Message: Event: 1
Message: Event: 2

6. Enhance Processing (Optional):

Filter messages containing the word "Event":

java
Copy code
kafkaStream
.map(record -> record.value())
.filter(message -> message.contains("Event"))
.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Filtered: " + record));
});

7. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh
Output
1. Kafka & Spark Setup:
- Start Kafka & Zookeeper:
[INFO] Starting Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server started on port 9092

2. Spark Streaming Application:


- Spark application started:
[INFO] Spark Streaming initialized with batch duration 5 seconds.
[INFO] Subscribed to topic stream-topic.

3. Send Data to Kafka:


> Hello, Spark Streaming!
> Event: 1
> Event: 2

4. Verify Stream Processing:


Message: Hello, Spark Streaming!
Message: Event: 1
Message: Event: 2

5. Clean Up:
- Stop Kafka & Zookeeper:
[INFO] Stopping Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server stopped.

Result:

You have successfully built a real-time stream processing application using Apache Spark
Streaming and Kafka. This setup reads messages from a Kafka topic, processes them in real
time, and can be expanded to include more complex transformations or output to various sinks.
6. Build a Micro-batch application
Objective:

Create a micro-batch application using Apache Spark Streaming that processes incoming
data in small batches (windows), performs transformations, and outputs results.

In Spark Streaming, data is processed in small batches of a configurable size (e.g., 1 second, 5
seconds). These small time-based units of processing are known as micro-batches.

Steps to Build a Micro-batch Application with Spark Streaming

1. Set Up Apache Kafka (or another data source):

1. Install Apache Kafka if it's not already installed. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create a Kafka Topic for testing:

bash
Copy code
bin/kafka-topics.sh --create --topic microbatch-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

3. Start Kafka Producer to send data to the topic:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
microbatch-topic

Example messages:

vbnet
Copy code
Event 1
Event 2
Event 3

2. Spark Streaming Application Code:

You’ll create a Spark Streaming application that consumes data from Kafka and processes it in
micro-batches. In this example, we'll process data every 5 seconds.

Maven Dependency (if using Maven):

xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>

Java Code Example: Micro-batch Application:

MicroBatchStreaming.java:

java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;

import java.util.HashMap;
import java.util.Map;

public class MicroBatchStreaming {


public static void main(String[] args) throws Exception {

// 1. Spark configuration
SparkConf conf = new SparkConf().setAppName("MicroBatchApp");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval of 5 seconds

// 2. Kafka parameters
String bootstrapServers = "localhost:9092";
String groupId = "microbatch-group";
String topic = "microbatch-topic";

// 3. Set Kafka consumer properties


Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

// 4. Define the Kafka Stream (micro-batch)


InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// 5. Processing Data - For each micro-batch


kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
System.out.println("Processing batch:");
rdd.foreach(record -> {
System.out.println("Event: " + record.value());
});
}
});

// 6. Start the stream processing


ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run the Application:

1. Compile the Java application (if using Maven or SBT):

bash
Copy code
mvn clean package

2. Submit the Spark application:

bash
Copy code
./bin/spark-submit --class MicroBatchStreaming --master local[2]
target/your-app-jar-file.jar

4. Send Data to Kafka:

Send test data through the Kafka producer to simulate incoming events:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic microbatch-
topic

Example data:

vbnet
Copy code
Event 1
Event 2
Event 3

5. Monitor the Output:

Once the application starts, Spark will process the incoming events every 5 seconds. You should
see the following output for each micro-batch:

vbnet
Copy code
Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3

6. Enhancing the Application (Optional):

You can add more processing logic such as filtering, aggregation, or windowed operations in the
foreachRDD block. For example, to count the number of events in each micro-batch:

java
Copy code
kafkaStream
.map(record -> record.value())
.count()
.foreachRDD(rdd -> {
rdd.collect().forEach(count -> {
System.out.println("Batch size: " + count);
});
});
7. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming (gracefully terminate the Spark job):

bash
Copy code
ssc.stop();

Output
1. Kafka & Spark Setup:
- Zookeeper started on port 2181.
- Kafka Server started on port 9092.
- Created topic: microbatch-topic.

2. Spark Streaming Application:


- Spark initialized with 5-second micro-batch interval.
- Subscribed to topic: microbatch-topic.

3. Send Data to Kafka:


- Kafka Producer Messages:
> Event 1
> Event 2
> Event 3

4. Outcomes:
- Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3

5. Clean Up:
- Kafka and Zookeeper stopped.
Result:

You’ve successfully created a micro-batch processing application using Spark Streaming and
Kafka. This application processes data in small, time-based intervals (micro-batches) and can be
extended with complex transformations, aggregations, or output to external systems (e.g.,
databases, HDFS).

7. Real-time Fraud and Anomaly Detection,


Objective:

Build a real-time fraud detection system using Apache Spark Streaming to process
incoming data (e.g., financial transactions) and detect anomalies or fraudulent activity.

Steps to Implement Real-Time Fraud Detection

1. Set Up Kafka (or another data source):

1. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic for transactions:

bash
Copy code
bin/kafka-topics.sh --create --topic transactions --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1

3. Kafka Producer to send transaction data:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
transactions

Example transaction data:


makefile
Copy code
accountID:1234,amount:1000.0,location:NYC,ip:192.168.1.1,timestamp:16171
87362
accountID:5678,amount:5000.0,location:LA,ip:192.168.1.2,timestamp:161718
7462

2. Build Spark Streaming Application:

Maven Dependencies:

xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>

Java Code: Real-time Fraud Detection:

FraudDetectionStream.java:

java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;

import java.util.HashMap;
import java.util.Map;

public class FraudDetectionStream {


public static void main(String[] args) throws Exception {

// Spark Configuration
SparkConf conf = new SparkConf().setAppName("FraudDetection");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval

// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "fraud-detection-group";
String topic = "transactions";

Map<String, Object> kafkaParams = new HashMap<>();


kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

// Define Kafka Stream


InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),

ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// Process each micro-batch


kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
rdd.foreach(record -> {
String transaction = record.value();
String[] fields = transaction.split(",");
double amount = Double.parseDouble(fields[1].split(":")
[1]);

// Example fraud detection: Flag transactions > $2000


if (amount > 2000) {
System.out.println("Potential Fraud Detected: " +
transaction);
}
});
}
});

// Start the streaming context


ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile the Java application (if using Maven):

bash
Copy code
mvn clean package

2. Submit Spark Application:


bash
Copy code
./bin/spark-submit --class FraudDetectionStream --master local[2]
target/your-app-jar-file.jar

4. Monitor Output:

Once the Spark job starts, it will process the incoming transactions. Transactions over $2000 will
be flagged as potentially fraudulent.

Example output:

sql
Copy code
Potential Fraud Detected:
accountID:5678,amount:5000.0,location:LA,ip:192.168.1.2,timestamp:1617187462

5. Enhance Fraud Detection Logic:

You can enhance the detection logic using more sophisticated techniques:

 Statistical Anomaly Detection: Use moving averages or Z-scores for detecting abnormal
spending patterns.
 Machine Learning Models: Train an anomaly detection model and use Spark Streaming
to apply it in real-time.
 Time-Series Analysis: Analyze transaction frequency, location, and patterns over time.

Example: Using Z-score for anomaly detection:

java
Copy code
double mean = ...; // Calculate mean of transactions
double stdDev = ...; // Calculate standard deviation
double zScore = (amount - mean) / stdDev;
if (Math.abs(zScore) > 3) {
System.out.println("Anomalous transaction detected: " + transaction);
}

6. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming:

bash
Copy code
ssc.stop();
Output

Result:

You’ve built a real-time fraud detection system using Apache Spark Streaming and Kafka.
This application processes transaction data in micro-batches and flags potentially fraudulent
transactions based on simple thresholds. For a more robust system, consider using machine
learning, statistical models, or time-series analysis for more sophisticated fraud detection.

8. Real-time personalization, Marketing, Advertising


Objective:

Build a real-time personalization system using Apache Spark Streaming to tailor


marketing and advertising content based on user behavior and interaction data.
Steps to Implement the System:

1. Set Up Kafka (or another data source):

1. Start Kafka and Zookeeper:

bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. Create Kafka Topic for user interactions:

bash
Copy code
bin/kafka-topics.sh --create --topic user-interactions --bootstrap-
server localhost:9092 --partitions 1 --replication-factor 1

3. Kafka Producer to simulate user interactions:

bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic user-
interactions

Example interaction data:

sql
Copy code
userID:1234,action:click,category:electronics,productID:5678,timestamp:1
617187362
userID:5678,action:view,category:clothing,productID:1234,timestamp:16171
87462

2. Build Spark Streaming Application:

Maven Dependency:

Include dependencies for Spark Streaming and Kafka integration:

xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>

Java Code: Real-time Personalization:

RealTimePersonalization.java:

java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;

import java.util.HashMap;
import java.util.Map;

public class RealTimePersonalization {


public static void main(String[] args) throws Exception {
// Spark Configuration
SparkConf conf = new
SparkConf().setAppName("RealTimePersonalization");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval

// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "personalization-group";
String topic = "user-interactions";

Map<String, Object> kafkaParams = new HashMap<>();


kafkaParams.put("bootstrap.servers", bootstrapServers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", "false");

// Define Kafka Stream


InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);

// Process each micro-batch and personalize content


kafkaStream.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
rdd.foreach(record -> {
String interaction = record.value();
String[] fields = interaction.split(",");
String userID = fields[0].split(":")[1];
String action = fields[1].split(":")[1];
String category = fields[2].split(":")[1];
String productID = fields[3].split(":")[1];

// Example personalization: Show relevant ads or


recommendations
if (action.equals("click") || action.equals("view")) {
System.out.println("Personalized Ad/Recommendation for
User " + userID + ": "
+ "Category: " + category + ",
Product: " + productID);
}
});
}
});

// Start streaming context


ssc.start();
ssc.awaitTermination();
}
}

3. Build and Run:

1. Compile the Java application:

bash
Copy code
mvn clean package

2. Submit Spark Application:

bash
Copy code
./bin/spark-submit --class RealTimePersonalization --master local[2]
target/your-app-jar-file.jar

4. Monitor and Validate Output:


Once the Spark job is running, it will process incoming user interactions and personalize content.

Example output:

yaml
Copy code
Personalized Ad/Recommendation for User 1234: Category: electronics, Product:
5678
Personalized Ad/Recommendation for User 5678: Category: clothing, Product:
1234

5. Enhance the Personalization Engine:

 User Profiling: Combine user behavior data (e.g., past purchases, browsing) to create
richer profiles.
 Collaborative Filtering: Use algorithms like ALS (Alternating Least Squares) to
recommend products based on similar users' interactions.
 Machine Learning: Apply predictive models to tailor ads and recommendations based
on user preferences and interactions.

6. Clean Up:

1. Stop Kafka:

bash
Copy code
bin/kafka-server-stop.sh

2. Stop Zookeeper:

bash
Copy code
bin/zookeeper-server-stop.sh

3. Stop Spark Streaming:

bash
Copy code
ssc.stop();

Output
Result:

You've created a real-time personalization system using Apache Spark Streaming and
Kafka. This system processes user interaction data (e.g., clicks, views, purchases) to deliver
personalized content such as ads and recommendations. To improve, you can incorporate
machine learning models, collaborative filtering, or advanced user profiling to enhance the
personalization experience for users.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy