B43 BDA Exp7
B43 BDA Exp7
PART A
(PART A : TO BE REFERRED BY STUDENTS)
Experiment No. 07
A.1 Aim:
To implement DGIM algorithm using java/Python.
A-2 Prerequisite
Java setup
A.3 OutCome
Students will be able to interpret business models and scientific computing paradigms, and
apply software tools for big data analytics..
A.4 Theory:
Data Stream Mining is the process of extracting knowledge structures from continuous, rapid
data records.
A data stream is an ordered sequence of instances that in many applications of data stream
mining can be read only once or a small number of times using limited computing and
storage capabilities.
TYPES OF QUERIES
Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the
maximum value seen so far in the stream S?
Standing queries- You are asking a query to the system say “Anytime you have an answer
to this query send me the response” , here you don't get the answer immediately .
Now let us suppose we have a window of length N (say N=24) on a binary system, We want
at all times to be able to answer a query of the form “ How many 1’s are there in the last K
bits?” for K<=N.
Designed to find the number 1’s in a data set. This algorithm uses O(log 2N) bits to represent
a window of N bit, allows to estimate the number of 1’s in the window with and error of no
more than 50%.
In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it arrives.
if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions
are recognized with the window size N (the window sizes are usually taken as a multiple of
2).The windows are divided into buckets consisting of 1’s and 0's.
1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g-. right end. 1001011 —› a bucket of size 4 ,having four 1’s
and starting with 1 on it's
2. Every bucket should have at least one 1, else no bucket can be formed.
4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)
Estimating the number of 1’s and counting the buckets in the given data stream.
This picture shows how we can form the buckets based on the number of ones by following the
rules.
In the given data stream let us assume the new bit arrives from the right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the
buckets. But what if the new bit that arrives is 1, then we need to make changes..
- Create a new bucket with the current timestamp and size 1.
If there was only one bucket of size 1, then nothing more needs to be done. However, if
there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the second
step in the picture) We fix the problem by combining the leftmost(earliest) two buckets of
size 1. (purple box)
To combine any two adjacent buckets of the same size, replace them by one bucket of twice
the size. The timestamp of the new bucket is the timestamp of the rightmost of the two
buckets.
Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so,
we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may
ripple through the bucket sizes.
You can continue if current timestamp- leftmost bucket timestamp of window < N (=24 here)
E.g. 103-87=16 < 24 so I continue, if it greater or equal to then I stop.
Counting the sizes of the buckets in the last 20 bits, we say, there are 11 ones.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
Roll. No.: B43 Name: Nikhil Aher
Class: Fourth Year (B) Batch: B3
Date of Experiment: 05/09/24 Date of Submission: 12/09/24
Grade:
#c1ass Bucket stores no of 1's as its size and the rightmost 1 as its
timestamp class Bucket:
def init (self, size, time
stamp): self.size = size
self.time stamp = time stamp
n=n+1
if(n == 32):
output file.write("\
n") a c file.write("\
n") break
merge_and_estimate(bucketList, bin_stream, bueket_counter,
output_file) actual count(bin stream, a c file)
#counts the actual no of 1's for every bit entering into the last 32
bits def actual count(bin stream, a c file):
ac = 0
d=0
j=1
for x in range(32, 1en(bin stream)):
for y in range(j, x + 1):
if(bin stream[y] == 1):
ac = ac + 1
a c file.write("%i " % ac)
ac = 0
d=d+
1j=j+
1
if(d == 32):
a c file.write("\n")
d=0
#counts no of 1's for every bit entering into the last 32 bit stream using
buckets def merge_and_estimate(bucketList, bin_stream, bucket_counter,
output_file):
z = 32 # sliding window
size d = 0
for x in range(32, 1en(bin stream)):
sum = 0
if(bin stream[x] == 1):
bucketList.append(Bucket(1, x + 1))
bucket counter = bucket counter + 1
bc = bucket counter
while (be != 0):
if(bc - 3 >= 0):
if(bucketList[bc - 3].size == bucketList[be - 1].size): #checks if the size appears 3rd
time size = bucketList[bc - 2].size
bucketList[be - 2].size = size * 2
del bucketList[bc - 3]
bucket counter = bucket counter -
1 bc = be - 1
b s = bucket counter
while (b_s > 0): #estimate for every new bit into the last 32 bit
stream 1 = bucketList[bucket counter - 1].time stamp - z
k = bucketList[b s - 1].time stamp
if(k > 1):
sum += bucketList[b s - 1].size
else:
su
m
+
=
in
t(
b
uc
ke
tL
ist
[b
s-
1]
.si
ze
/
2)
br
ea
k
bs=bs-1
d=d+1
output file.write("%i " %
sum) if(d == 32):
output file.write("\n")
d=0
text “""In the 1990’s “data mining” was an exciting and popular new concept. Around 2010,
people instead started to speak of “big data.” Today, the popular term is “data science.”
However, during all this time, the concept remained the same: use the most powerful
hardware, the most powerful programming systems, and the most efficient algorithms to
solve problems in science, commerce, healthcare, government, the humanities, and many
other fields of human endeavor. To many, data mining is the process of creating a model
from data, often by the process of machine learning, which we mention in Section 1.1.3 and
discuss more fully in Chapter 12. However, more generally, the objective of data mining is
an algorithm. For instance, we discuss locality-sensitive hashing in Chapter 3 and a number
of stream-mining algorithms in Chapter 4, none of which involve a model. Yet in many
important applications, the hard part is creating the model, and once the model is available,
the algorithm to use the model is straightforward. Consider the problem of detecting emails
that are phishing attacks. The most common approach is to build a model of phishing emails,
perhaps by examining emails that people have recently reported as"""
B.4 Conclusion:
In conclusion, the DGIM algorithm provides an effective solution for
approximate counting in data streams, offering a memory-efficient way to handle real-time
data over sliding windows. Its use of logarithmic bucketing strikes a balance between
accuracy and resource constraints, making it ideal for high-frequency applications like
network monitoring or social media analysis. However, its performance can be affected by
data spikes, and periodic recalibration may be needed to maintain accuracy. Overall, DGIM
remains a powerful tool for scalable, real-time data processing where precision can be traded
for efficiency.