Stream Processing Lab Manual
Stream Processing Lab Manual
Install MongoDB
Aim:
MongoDB is a NoSQL database that stores data in a flexible, JSON-like format. Below are the
steps to install MongoDB on different platforms.
bash
Copy code
sudo apt update
bash
Copy code
sudo apt install -y libcurl4 openssl liblzma5
bash
Copy code
wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-
key add -
bash
Copy code
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu
focal/mongodb-org/5.0 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodb-org-5.0.list
bash
Copy code
sudo apt update
sudo apt install -y mongodb-org
bash
Copy code
sudo systemctl start mongod
bash
Copy code
sudo systemctl enable mongod
bash
Copy code
sudo systemctl status mongod
bash
Copy code
mongo
bash
Copy code
/bin/bash -c "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
bash
Copy code
brew tap mongodb/brew
bash
Copy code
brew install mongodb-community@5.0
bash
Copy code
brew services start mongodb/brew/mongodb-community
bash
Copy code
brew services list
bash
Copy code
mongo
During installation, make sure the option to add MongoDB to your system's PATH environment
variable is selected.
MongoDB should start automatically as a service. If not, you can manually start it:
bash
Copy code
net start MongoDB
bash
Copy code
mongo
To check if MongoDB is working properly, you can connect to the MongoDB shell. Type the
following command in your terminal or command prompt:
bash
Copy code
mongo
This will start the MongoDB shell, and you'll see a prompt like this:
markdown
Copy code
>
You can then run commands like show dbs to see the databases or use test to switch to the
test database.
yaml
Copy code
net:
bindIp: 0.0.0.0
3. Restart MongoDB:
o Restart the MongoDB service to apply the changes:
On Linux/macOS:
bash
Copy code
sudo systemctl restart mongod
On Windows, restart the MongoDB service via the Services application or using
net start MongoDB.
1. Create an Admin User: Open the MongoDB shell and run the following commands:
bash
Copy code
use admin
db.createUser({ user: "admin", pwd: "password", roles: [{ role:
"userAdminAnyDatabase", db: "admin" }] })
2. Enable Authentication:
o Edit the mongod.conf file and enable authentication:
yaml
Copy code
security:
authorization: "enabled"
3. Restart MongoDB:
o Restart MongoDB to apply the changes.
bash
Copy code
mongo -u admin -p password --authenticationDatabase admin
output
Here's a simplified output you might see when installing MongoDB on Ubuntu:
Bash
Reading package lists... Done
Building dependency tree
Reading state information
The following NEW packages will be installed: 1
Bash
mongod --version
MongoDB shell version v5.0.15
Result:
You've successfully installed MongoDB on your system. MongoDB is now ready for use in
storing, querying, and managing your NoSQL data. You can now integrate it into your
applications or use it directly for data storage in real-time applications like analytics, logging, or
content management.
This guide demonstrates how to build a simple CRUD (Create, Read, Update, Delete)
application using MongoDB and Node.js. The app will perform basic operations on a MongoDB
database to manage user data (name, email, and age).
bash
Copy code
mkdir mongo-crud-app
cd mongo-crud-app
npm init -y
bash
Copy code
npm install express mongodb
Step 2: Create Application
Create a file app.js and set up the server and MongoDB connection.
javascript
Copy code
const express = require('express');
const { MongoClient, ObjectId } = require('mongodb');
const app = express();
app.use(express.json());
// Connect to MongoDB
MongoClient.connect(url, { useNewUrlParser: true, useUnifiedTopology: true })
.then(client => {
console.log('Connected to MongoDB');
db = client.db(dbName);
})
.catch(err => console.error('Failed to connect to MongoDB:', err));
// Routes
app.get('/', (req, res) => res.send('MongoDB CRUD Application'));
// Start server
const port = 3000;
app.listen(port, () => console.log(`Server is running on http://localhost:$
{port}`));
bash
Copy code
node app.js
json
Copy code
{ "name": "John Doe", "email": "john@example.com", "age": 30 }
json
Copy code
{ "name": "John Smith", "email": "john.smith@example.com", "age": 31 }
4. Delete User (DELETE /users/:id): Delete user by their unique _id.
bash
Copy code
mongo
use crudApp
db.users.find().pretty()
Output
admin 0.000GB
local 0.000GB
switched to db myDatabase
> db.createCollection("users")
{ "ok" : 1 }
> db.users.find()
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
Result:
This simple application demonstrates how to perform basic CRUD operations with MongoDB
and Node.js. You can extend this by adding validation, authentication, or advanced MongoDB
features like aggregation. This CRUD application can be further built upon to create complex
systems for real-world applications.
Aim:
Once you've set up your CRUD application using MongoDB and Node.js, you may want to
query the system to retrieve, update, or delete data based on specific conditions. In this section,
we'll explore various MongoDB query operations that you can perform on your database.
Procedure:
js
Copy code
db.users.find().pretty()
This command will return all documents (users) in the users collection, formatted neatly for
better readability.
js
Copy code
db.users.find({ _id: ObjectId("user-id-here") }).pretty()
Replace "user-id-here" with the actual _id of the user you want to query. In MongoDB, the
_id is typically an ObjectId, which you can generate or retrieve programmatically.
js
Copy code
db.users.find({ _id: ObjectId("613b2a5b9e24f53c6a05689b") }).pretty()
js
Copy code
db.users.find({ name: "John Doe" }).pretty()
To retrieve users who meet specific conditions (e.g., age > 25):
js
Copy code
db.users.find({ age: { $gt: 25 } }).pretty()
This will return all users where the age field is greater than 25. You can use various comparison
operators such as $lt (less than), $gte (greater than or equal to), $lte (less than or equal to),
and $ne (not equal to).
Example:
js
Copy code
db.users.find({ age: { $gt: 25 }, name: "John" }).pretty()
This query will return all users who are older than 25 and have the name John.
6. Query Users with a Regular Expression (e.g., Name Starts with "J")
If you want to find users whose name starts with "J", you can use a regular expression in
MongoDB.
js
Copy code
db.users.find({ name: { $regex: "^J", $options: "i" } }).pretty()
This query uses the regular expression ^J to match names that start with "J". The $options:
"i" makes the search case-insensitive.
You can sort the query results by a field. For example, to get users sorted by age in ascending
order:
js
Copy code
db.users.find().sort({ age: 1 }).pretty()
js
Copy code
db.users.find().sort({ age: -1 }).pretty()
8. Limit the Number of Results
If you only want a specific number of users returned, you can limit the results using the limit()
method. For example, to retrieve the first 3 users:
js
Copy code
db.users.find().limit(3).pretty()
This will return the first 3 users from the users collection.
MongoDB provides the aggregation framework to perform complex queries like grouping,
filtering, and transforming data. For example, you can group users by their age:
js
Copy code
db.users.aggregate([
{ $group: { _id: "$age", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
To update a user's information (e.g., changing the name of a user with a specific _id):
js
Copy code
db.users.updateOne(
{ _id: ObjectId("user-id-here") },
{ $set: { name: "Updated Name" } }
)
This command will update the name field of the user with the provided _id. Use $set to modify
specific fields without affecting others.
js
Copy code
db.users.deleteOne({ _id: ObjectId("user-id-here") })
This command deletes the user with the specified _id. You can also use deleteMany() if you
want to delete multiple users that match certain conditions.
Output
Query All Users:
[
{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
},
{
"_id" : ObjectId("650d6509341234567890def"),
"name" : "Bob",
"age" : 25
}
]
{
"_id" : ObjectId("650d6509341234567890abc"),
"name" : "Alice",
"age" : 30
}
Result
In this guide, we've demonstrated several MongoDB query techniques to interact with the users
collection. These include basic queries, conditional queries, sorting, limiting, and aggregation
queries. MongoDB's flexibility allows you to perform powerful and efficient data retrieval and
manipulation tasks using these techniques.
Objective:
Create an event stream using Apache Kafka by setting up a producer, consumer, and topic.
Procedure:
bash
Copy code
wget https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz
tar -xvzf kafka_2.13-3.4.0.tgz
cd kafka_2.13-3.4.0
bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
3. Start Kafka:
bash
Copy code
bin/kafka-server-start.sh config/server.properties
bash
Copy code
bin/kafka-console-producer.sh --topic event-stream --bootstrap-server
localhost:9092
Type events (e.g., Event 1: User logged in) and press Enter.
EventProducer.java:
java
Copy code
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
public class EventProducer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("event-stream", "key", "Event 1"));
producer.close();
}
}
bash
Copy code
javac EventProducer.java
java EventProducer
bash
Copy code
bin/kafka-console-consumer.sh --topic event-stream --bootstrap-server
localhost:9092 --from-beginning
Option 2: Java Consumer
EventConsumer.java:
java
Copy code
import org.apache.kafka.clients.consumer.*;
import java.util.Properties;
public class EventConsumer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "group1");
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("event-stream"));
consumer.poll(1000).forEach(record ->
System.out.println(record.value()));
}
}
bash
Copy code
javac EventConsumer.java
java EventConsumer
5. Clean Up:
Stop Kafka:
bash
Copy code
bin/kafka-server-stop.sh
Stop Zookeeper:
bash
Copy code
bin/zookeeper-server-stop.sh
Output
... # Download logs
kafka_2.13-3.4.0/
kafka_2.13-3.4.0/LICENSE
kafka_2.13-3.4.0/NOTICE
kafka_2.13-3.4.0/bin/
... # List of extracted files
Result:
You’ve successfully set up a Kafka event stream by creating a producer, consumer, and topic.
You’ve also learned how to send and consume events from Kafka using both the console and
Java APIs.
Build a real-time stream processing application using Apache Spark Streaming and Kafka.
Procedure:
bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
o Start Kafka:
bash
Copy code
bin/kafka-server-start.sh config/server.properties
bash
Copy code
bin/kafka-topics.sh --create --topic stream-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1
xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>
SparkStreamingKafka.java:
java
Copy code
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.util.HashMap;
import java.util.Map;
InputDStream<org.apache.kafka.clients.consumer.ConsumerRecord<String,
String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);
kafkaStream.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Message: " +
record.value()));
});
ssc.start();
ssc.awaitTermination();
}
}
bash
Copy code
mvn clean package
bash
Copy code
./bin/spark-submit --class SparkStreamingKafka --master local[2]
target/your-app-jar-file.jar
bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic stream-
topic
Example messages:
makefile
Copy code
Hello, Spark Streaming!
Event: 1
Event: 2
After running the Spark app, you should see messages in the console:
vbnet
Copy code
Message: Hello, Spark Streaming!
Message: Event: 1
Message: Event: 2
java
Copy code
kafkaStream
.map(record -> record.value())
.filter(message -> message.contains("Event"))
.foreachRDD(rdd -> {
rdd.foreach(record -> System.out.println("Filtered: " + record));
});
7. Clean Up:
1. Stop Kafka:
bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:
bash
Copy code
bin/zookeeper-server-stop.sh
Output
1. Kafka & Spark Setup:
- Start Kafka & Zookeeper:
[INFO] Starting Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server started on port 9092
5. Clean Up:
- Stop Kafka & Zookeeper:
[INFO] Stopping Kafka & Zookeeper Server
[INFO] Kafka & Zookeeper Server stopped.
Result:
You have successfully built a real-time stream processing application using Apache Spark
Streaming and Kafka. This setup reads messages from a Kafka topic, processes them in real
time, and can be expanded to include more complex transformations or output to various sinks.
6. Build a Micro-batch application
Objective:
Create a micro-batch application using Apache Spark Streaming that processes incoming
data in small batches (windows), performs transformations, and outputs results.
In Spark Streaming, data is processed in small batches of a configurable size (e.g., 1 second, 5
seconds). These small time-based units of processing are known as micro-batches.
1. Install Apache Kafka if it's not already installed. Start Kafka and Zookeeper:
bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bash
Copy code
bin/kafka-topics.sh --create --topic microbatch-topic --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1
bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
microbatch-topic
Example messages:
vbnet
Copy code
Event 1
Event 2
Event 3
You’ll create a Spark Streaming application that consumes data from Kafka and processes it in
micro-batches. In this example, we'll process data every 5 seconds.
xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>
MicroBatchStreaming.java:
java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import java.util.HashMap;
import java.util.Map;
// 1. Spark configuration
SparkConf conf = new SparkConf().setAppName("MicroBatchApp");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval of 5 seconds
// 2. Kafka parameters
String bootstrapServers = "localhost:9092";
String groupId = "microbatch-group";
String topic = "microbatch-topic";
ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);
bash
Copy code
mvn clean package
bash
Copy code
./bin/spark-submit --class MicroBatchStreaming --master local[2]
target/your-app-jar-file.jar
Send test data through the Kafka producer to simulate incoming events:
bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic microbatch-
topic
Example data:
vbnet
Copy code
Event 1
Event 2
Event 3
Once the application starts, Spark will process the incoming events every 5 seconds. You should
see the following output for each micro-batch:
vbnet
Copy code
Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3
You can add more processing logic such as filtering, aggregation, or windowed operations in the
foreachRDD block. For example, to count the number of events in each micro-batch:
java
Copy code
kafkaStream
.map(record -> record.value())
.count()
.foreachRDD(rdd -> {
rdd.collect().forEach(count -> {
System.out.println("Batch size: " + count);
});
});
7. Clean Up:
1. Stop Kafka:
bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:
bash
Copy code
bin/zookeeper-server-stop.sh
bash
Copy code
ssc.stop();
Output
1. Kafka & Spark Setup:
- Zookeeper started on port 2181.
- Kafka Server started on port 9092.
- Created topic: microbatch-topic.
4. Outcomes:
- Processing batch:
Event: Event 1
Event: Event 2
Event: Event 3
5. Clean Up:
- Kafka and Zookeeper stopped.
Result:
You’ve successfully created a micro-batch processing application using Spark Streaming and
Kafka. This application processes data in small, time-based intervals (micro-batches) and can be
extended with complex transformations, aggregations, or output to external systems (e.g.,
databases, HDFS).
Build a real-time fraud detection system using Apache Spark Streaming to process
incoming data (e.g., financial transactions) and detect anomalies or fraudulent activity.
bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bash
Copy code
bin/kafka-topics.sh --create --topic transactions --bootstrap-server
localhost:9092 --partitions 1 --replication-factor 1
bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
transactions
Maven Dependencies:
xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>
FraudDetectionStream.java:
java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import java.util.HashMap;
import java.util.Map;
// Spark Configuration
SparkConf conf = new SparkConf().setAppName("FraudDetection");
StreamingContext ssc = new StreamingContext(conf,
Durations.seconds(5)); // Micro-batch interval
// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "fraud-detection-group";
String topic = "transactions";
ConsumerStrategies.Subscribe(java.util.Collections.singleton(topic),
kafkaParams)
);
bash
Copy code
mvn clean package
4. Monitor Output:
Once the Spark job starts, it will process the incoming transactions. Transactions over $2000 will
be flagged as potentially fraudulent.
Example output:
sql
Copy code
Potential Fraud Detected:
accountID:5678,amount:5000.0,location:LA,ip:192.168.1.2,timestamp:1617187462
You can enhance the detection logic using more sophisticated techniques:
Statistical Anomaly Detection: Use moving averages or Z-scores for detecting abnormal
spending patterns.
Machine Learning Models: Train an anomaly detection model and use Spark Streaming
to apply it in real-time.
Time-Series Analysis: Analyze transaction frequency, location, and patterns over time.
java
Copy code
double mean = ...; // Calculate mean of transactions
double stdDev = ...; // Calculate standard deviation
double zScore = (amount - mean) / stdDev;
if (Math.abs(zScore) > 3) {
System.out.println("Anomalous transaction detected: " + transaction);
}
6. Clean Up:
1. Stop Kafka:
bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:
bash
Copy code
bin/zookeeper-server-stop.sh
bash
Copy code
ssc.stop();
Output
Result:
You’ve built a real-time fraud detection system using Apache Spark Streaming and Kafka.
This application processes transaction data in micro-batches and flags potentially fraudulent
transactions based on simple thresholds. For a more robust system, consider using machine
learning, statistical models, or time-series analysis for more sophisticated fraud detection.
bash
Copy code
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bash
Copy code
bin/kafka-topics.sh --create --topic user-interactions --bootstrap-
server localhost:9092 --partitions 1 --replication-factor 1
bash
Copy code
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic user-
interactions
sql
Copy code
userID:1234,action:click,category:electronics,productID:5678,timestamp:1
617187362
userID:5678,action:view,category:clothing,productID:1234,timestamp:16171
87462
Maven Dependency:
xml
Copy code
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.4.0</version>
</dependency>
</dependencies>
RealTimePersonalization.java:
java
Copy code
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.kafka010.*;
import java.util.HashMap;
import java.util.Map;
// Kafka Parameters
String bootstrapServers = "localhost:9092";
String groupId = "personalization-group";
String topic = "user-interactions";
bash
Copy code
mvn clean package
bash
Copy code
./bin/spark-submit --class RealTimePersonalization --master local[2]
target/your-app-jar-file.jar
Example output:
yaml
Copy code
Personalized Ad/Recommendation for User 1234: Category: electronics, Product:
5678
Personalized Ad/Recommendation for User 5678: Category: clothing, Product:
1234
User Profiling: Combine user behavior data (e.g., past purchases, browsing) to create
richer profiles.
Collaborative Filtering: Use algorithms like ALS (Alternating Least Squares) to
recommend products based on similar users' interactions.
Machine Learning: Apply predictive models to tailor ads and recommendations based
on user preferences and interactions.
6. Clean Up:
1. Stop Kafka:
bash
Copy code
bin/kafka-server-stop.sh
2. Stop Zookeeper:
bash
Copy code
bin/zookeeper-server-stop.sh
bash
Copy code
ssc.stop();
Output
Result:
You've created a real-time personalization system using Apache Spark Streaming and
Kafka. This system processes user interaction data (e.g., clicks, views, purchases) to deliver
personalized content such as ads and recommendations. To improve, you can incorporate
machine learning models, collaborative filtering, or advanced user profiling to enhance the
personalization experience for users.