0% found this document useful (0 votes)
23 views12 pages

B43 BDA Exp7

Uploaded by

Nikhil Aher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

B43 BDA Exp7

Uploaded by

Nikhil Aher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

LAB MANUAL

PART A
(PART A : TO BE REFERRED BY STUDENTS)

Experiment No. 07
A.1 Aim:
To implement DGIM algorithm using java/Python.

A-2 Prerequisite
Java setup

A.3 OutCome
Students will be able to interpret business models and scientific computing paradigms, and
apply software tools for big data analytics..

A.4 Theory:

S MINING DATA STREAMS

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid
data records.
A data stream is an ordered sequence of instances that in many applications of data stream
mining can be read only once or a small number of times using limited computing and
storage capabilities.

TYPES OF QUERIES

Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the
maximum value seen so far in the stream S?

Standing queries- You are asking a query to the system say “Anytime you have an answer
to this query send me the response” , here you don't get the answer immediately .

Now let us suppose we have a window of length N (say N=24) on a binary system, We want
at all times to be able to answer a query of the form “ How many 1’s are there in the last K
bits?” for K<=N.

Here comes the DGIM Algorithm into picture.

COUNTING THE NUMBER OF 1’s IN THE DATA STREAM


DGIM algorithm {Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log 2N) bits to represent
a window of N bit, allows to estimate the number of 1’s in the window with and error of no
more than 50%.

So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it arrives.
if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions
are recognized with the window size N (the window sizes are usually taken as a multiple of
2).The windows are divided into buckets consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g-. right end. 1001011 —› a bucket of size 4 ,having four 1’s
and starting with 1 on it's

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)

Let us take an example to understand the algorithm.

Estimating the number of 1’s and counting the buckets in the given data stream.
This picture shows how we can form the buckets based on the number of ones by following the
rules.

In the given data stream let us assume the new bit arrives from the right. When the new bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the
buckets. But what if the new bit that arrives is 1, then we need to make changes..
- Create a new bucket with the current timestamp and size 1.

If there was only one bucket of size 1, then nothing more needs to be done. However, if
there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the second
step in the picture) We fix the problem by combining the leftmost(earliest) two buckets of
size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of twice
the size. The timestamp of the new bucket is the timestamp of the rightmost of the two
buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so,
we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may
ripple through the bucket sizes.

How long can you continue doing this...

You can continue if current timestamp- leftmost bucket timestamp of window < N (=24 here)
E.g. 103-87=16 < 24 so I continue, if it greater or equal to then I stop.

Finally the answer to the query.

How many 1’s are there in the last 20 bits?

Counting the sizes of the buckets in the last 20 bits, we say, there are 11 ones.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
Roll. No.: B43 Name: Nikhil Aher
Class: Fourth Year (B) Batch: B3
Date of Experiment: 05/09/24 Date of Submission: 12/09/24
Grade:

B.1.DGIM algorithm Write a program in java by considering any stream to implement


DGIM algorithm.

#c1ass Bucket stores no of 1's as its size and the rightmost 1 as its
timestamp class Bucket:
def init (self, size, time
stamp): self.size = size
self.time stamp = time stamp

def input bin stream(Input):


binary file = open("Binary Input.txt", "w") #this file stores the 0/1
stream bin stream = []

# converting the letters to 0/1 stream


for i in Input:
if(ord(i) in range(65, 91) or ord(i) in range(97,
123)): if(ord(i) % 2 0):
bin stream.append(0)
else:
bin stream.append(1)
# storing the binary stream into the Binary_Input file
count = 0
for x in bin stream:
binary file.write("%i" %
x) count = count + 1
if(count == 32):
binary file.write(”\n")
count = 0
binary
file.close() return
bin stream

def intial buckets(bin stream):


output file = open("Final Output.txt",
"w") a c fit = open("Actual count.txt”,
"w") bucketList = []

# Counting the no of 1's for first 32 bits (accurate


count) oc = 0 # one count
c=0
n=0
bucket counter = 0
for x in
bin_stream:
if(x == 1):
c=c+1
oc = oc +
1
if(c == 8 or c == 12 or c == 16 or c == 18 or c == 19):
bucketList.append(Bucket(oc, n + 1)) # creating initial buckets for first 32
bits oc = 0
bucket_counter = bucket_eounter + 1
output file.write("%i " % c)
a c file.write("%i " % c)
else:
output file.write("%i " % c)
a c file.write("%i " % c)

n=n+1
if(n == 32):
output file.write("\
n") a c file.write("\
n") break
merge_and_estimate(bucketList, bin_stream, bueket_counter,
output_file) actual count(bin stream, a c file)

#counts the actual no of 1's for every bit entering into the last 32
bits def actual count(bin stream, a c file):
ac = 0
d=0
j=1
for x in range(32, 1en(bin stream)):
for y in range(j, x + 1):
if(bin stream[y] == 1):
ac = ac + 1
a c file.write("%i " % ac)
ac = 0
d=d+
1j=j+
1
if(d == 32):
a c file.write("\n")
d=0

#counts no of 1's for every bit entering into the last 32 bit stream using
buckets def merge_and_estimate(bucketList, bin_stream, bucket_counter,
output_file):
z = 32 # sliding window
size d = 0
for x in range(32, 1en(bin stream)):
sum = 0
if(bin stream[x] == 1):
bucketList.append(Bucket(1, x + 1))
bucket counter = bucket counter + 1
bc = bucket counter
while (be != 0):
if(bc - 3 >= 0):
if(bucketList[bc - 3].size == bucketList[be - 1].size): #checks if the size appears 3rd
time size = bucketList[bc - 2].size
bucketList[be - 2].size = size * 2
del bucketList[bc - 3]
bucket counter = bucket counter -
1 bc = be - 1
b s = bucket counter
while (b_s > 0): #estimate for every new bit into the last 32 bit
stream 1 = bucketList[bucket counter - 1].time stamp - z
k = bucketList[b s - 1].time stamp
if(k > 1):
sum += bucketList[b s - 1].size
else:
su
m
+
=
in
t(
b
uc
ke
tL
ist
[b
s-
1]
.si
ze
/
2)
br
ea
k
bs=bs-1
d=d+1
output file.write("%i " %
sum) if(d == 32):
output file.write("\n")
d=0

text “""In the 1990’s “data mining” was an exciting and popular new concept. Around 2010,
people instead started to speak of “big data.” Today, the popular term is “data science.”
However, during all this time, the concept remained the same: use the most powerful
hardware, the most powerful programming systems, and the most efficient algorithms to
solve problems in science, commerce, healthcare, government, the humanities, and many
other fields of human endeavor. To many, data mining is the process of creating a model
from data, often by the process of machine learning, which we mention in Section 1.1.3 and
discuss more fully in Chapter 12. However, more generally, the objective of data mining is
an algorithm. For instance, we discuss locality-sensitive hashing in Chapter 3 and a number
of stream-mining algorithms in Chapter 4, none of which involve a model. Yet in many
important applications, the hard part is creating the model, and once the model is available,
the algorithm to use the model is straightforward. Consider the problem of detecting emails
that are phishing attacks. The most common approach is to build a model of phishing emails,
perhaps by examining emails that people have recently reported as"""

bin stream = input bin stream(text)


intial buckets(bin stream)
B.2 Input and Output:
Input:
Output:

B.3 Observations and learning:


The DGIM algorithm is designed for efficient approximate counting of events
over a sliding window in data streams, using logarithmic bucketing to reduce memory usage
while maintaining reasonable accuracy. It balances memory constraints with an acceptable
error margin, making it ideal for real-time analytics, such as monitoring traffic or keyword
trends in social media. While highly efficient, DGIM's accuracy may be challenged by bursty
data
patterns, requiring recalibration. Despite these limitations, it remains a key tool for stream processing
in applications that demand fast, scalable insights.

B.4 Conclusion:
In conclusion, the DGIM algorithm provides an effective solution for
approximate counting in data streams, offering a memory-efficient way to handle real-time
data over sliding windows. Its use of logarithmic bucketing strikes a balance between
accuracy and resource constraints, making it ideal for high-frequency applications like
network monitoring or social media analysis. However, its performance can be affected by
data spikes, and periodic recalibration may be needed to maintain accuracy. Overall, DGIM
remains a powerful tool for scalable, real-time data processing where precision can be traded
for efficiency.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy