0% found this document useful (0 votes)

21 views10 pages

Unit I - MMD - Lecture NoteStu

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views10 pages

Unit I - MMD - Lecture NoteStu

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

MINING MASSIVE DATASETS

(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES

Statistical Limits on Data Mining: Total Information Awareness, Bonferroni„s

Principle, An Example of Bonferroni„s Principle.
UNIT I Things Useful to Know: Importance of Words in Documents, Hash Functions,
Indexes, Secondary Storage, The Base of Natural Logarithms, Power Laws.

Statistical Limits on Data Mining:

A common sort of data-mining problem involves discovering unusual events hidden within
massive amounts of data.
Total Information Awareness:
Following the terrorist attack of Sept. 11, 2001, it was noticed that there were four people
enrolled in different flight schools, learning how to pilot commercial aircraft, although they were not
affiliated with any airline. It was conjectured that the information needed to predict and foil the
attack was available in data, but that there was then no way to examine the data and detect
suspicious events.
The response was a program called TIA, or Total Information Awareness, which was intended
to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many
other kinds of information in order to track terrorist activity.
Now information integration – the idea of relating and combining different data sources to
obtain insights that are not available from any one source – is often a key step on the way to solving
an important problem.
TIA naturally caused great concern among privacy advocates, and the project was eventually
killed by Congress.
However, the prospect of TIA or a system like it does raise many technical questions about its
feasibility.
One particular technical problem: if you look in your data for too many things at the same
time, you will see things that look interesting, but are in fact simply statistical artifacts and have
no significance.
Example:- If you search your data for activities that look like terrorist behavior, are you not
Mining Massive Datasets – UNIT 1 – Lecture Notes Page 1
going to find many innocent activities – or even illicit activities that are not terrorism – that will
result in visits from the police and maybe worse than just a visit? The answer is that it all depends on
how narrowly you define the activities that you look for.
Statisticians have seen this problem in many guises and have a theory, called, Bonferroni‟s
Principle.
Bonferroni‘s Principle
Suppose you have a certain amount of data, and you look for events of a certain type within
that data. You can expect events of this type to occur, even if the data is completely random, and the
number of occurrences of these events will grow as the size of the data grows.
These occurrences are “bogus,” in the sense that they have no cause other than that random
data will always have some number of unusual features that look significant but aren‟t.
A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to
avoid most of these bogus positive responses to a search through the data.
Bonferroni‟s principle, that helps us avoid treating random occurrences as if they were real.
Calculate the expected number of occurrences of the events you are looking for, on the
assumption that data is random. If this number is significantly larger than the number of real
instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a
statistical artifact rather than evidence of what you are looking for.
This observation is the informal statement of Bonferroni‟s principle.
An Example of Bonferroni‘s Principle
Suppose there are believed to be some “evil-doers” out there, and we want to detect them.
Suppose further that we have reason to believe that periodically, evil-doers gather at a hotel to plot
their evil.
Let us make the following assumptions about the size of the problem:
1. There are one billion people who might be evil-doers.
2. Everyone goes to a hotel one day in 100.
3. A hotel holds 100 people. Hence, there are 100,000 hotels – enough to hold the 1% of a
billion people who visit a hotel on any given day.
4. We shall examine hotel records for 1000 days.
To find evil-doers in this data, we shall look for people who, on two different days, were both
at the same hotel. Suppose, however, that there really are no evil-doers. That is, everyone behaves at

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 2

random, deciding with probability 0.01 to visit a hotel on any given day, and if so, choosing one of
the 105 hotels at random.
Would we ﬁnd any pairs of people who appear to be evil-doers?
We can do a simple approximate calculation as follows.
A person visiting a particular hotel is 0.01
Therefore the probability of any two people both deciding to visit a hotel on any given day is
0.01 * 0.01 = .0001
5
The chance that they will visit the same hotel is this probability divided by 10 , the number of

−9 5
hotels. Thus, the chance that they will visit the same hotel on one given day is 10 , i.e., 0.0001 / 10

The chance that they will visit the same hotel on two diﬀerent given days is the square of this
−18 −9 −9
number, 10 , i.e., 10 *10 Note that the hotels can be diﬀerent on the two days.

Now, we must consider how many events will indicate evil-doing. An “event” in this sense is
a pair of people and a pair of days, such that the two people were at the same hotel on each of the two
days.

2
To simplify the arithmetic, note that for large n, is about n /2. stands for nCr

formula is also known as the "combinations formula". We shall use this approximation in what
follows.

Thus, the number of pairs of people is

The number of pairs of days is

The expected number of events that look like evil-doing is the product of the number of pairs
of people, the number of pairs of days, and the probability that any one pair of people and pair of
days is an instance of the behavior we are looking for.

That number is
That is, there will be a quarter of a million pairs of people who look like evil-doers, even

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 3

though they are not.
Now, suppose there really are 10 pairs of evil-doers out there. The intrusion on the lives of
half a million innocent people, the work involved is suﬃciently great that this approach to ﬁnding
evil-doers is probably not feasible.

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 4

Things Useful to Know:
Importance of Words in Documents (TF.IDF measure of word importance)
In several applications of data mining, we shall be faced with the problem of categorizing
documents (sequences of words) by their topic. Typically, topics are identified by finding the special
words that characterize documents about that topic.
For instance, articles about baseball would tend to have many occurrences of words like
“ball,” “bat,” “pitch,” “run,” and so on. Until we have made the classification, it is not possible to
identify these words as characteristic.
Thus, classification often starts by looking at documents, and finding the significant words in
those documents. Our first guess might be that the words appearing most frequently in a document
are the most significant. However, that intuition is exactly opposite of the truth.
The most frequent words will most surely be the common words such as “the” or “and,”
which help build ideas but do not carry any significance themselves. In fact, the several hundred
most common words in English (called stop words) are often removed from documents before any
attempt to classify them.
In fact, the indicators of the topic are relatively rare words. However, not all rare words are
equally useful as indicators. There are certain words, for example “notwithstanding” or “albeit,” that
appear rarely in a collection of documents, yet do not tell us anything useful.
On the other hand, a word like “chukker” is probably equally rare, but tips us off that the
document is about the sport of polo. The difference between rare words that tell us something and
those that do not has to do with the concentration of the useful words in just a few documents.
The formal measure of how concentrated into relatively few documents are the occurrences of
a given word is called TF.IDF (Term Frequency times In-verse Document Frequency).

It is normally computed as follows. Suppose we have a collection of N documents. Deﬁne fij

to be the frequency (number of occurrences) of term (word) i in document j. Then, deﬁne the term

frequency TFij to be:

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 5

The IDF for a term is deﬁned as follows. Suppose term i appears in ni of the N documents in

the collection. Then IDFi = log2(N/ni). The TF.IDF score for term i in document j is then deﬁned to

be TFij × IDFi. The terms with the highest TF.IDF score are often the terms that best characterize
the topic of the document.

Hash Functions
The hash functions that make hash tables feasible are also essential components in a number
of data-mining algorithms, where the hash table takes an unfamiliar form.
First, a hash function h takes a hash-key value as an argument and produces a bucket number
as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number
of buckets. Hash-keys can be of any type.
There is an intuitive property of hash functions that they “randomize” hash-keys. To be
precise, if hash-keys are drawn randomly from a reasonable population of possible hash-keys, then h
will send approximately equal numbers of hash-keys to each of the B buckets.
It would be impossible to do so if, for example, the population of possible hash-keys were

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 6

smaller than B. Such a population would not be “reasonable.”
Suppose hash-keys are positive integers. A common and simple hash function is to pick h(x) = x mod
B, that is, the remainder when x is divided by B. That choice works well if our population of hash-
keys is all positive integers. 1/Bth of the integers will be assigned to each of the buckets.
However, suppose our population is the even integers, and B = 10. Then only buckets 0, 2, 4, 6, and 8
can be the value of h(x), and the hash function is distinctly nonrandom in its behavior. On the other
hand, if we picked B = 11, then we would find that 1/11th of the even integers get sent to each of the
11 buckets, so the hash function would work well in this case.
Thus, it is normally preferred that we choose B to be a prime. That choice reduces the chance of non
random behavior, although we still have to consider the possibility that all hash-keys have B as a
factor.
Indexes
An index is a data structure that makes it efficient to retrieve objects given the value of one or
more elements of those objects. The most common situation is one where the objects are records,
and the index is on one of the fields of that record. Given a value v for that field, the index lets us
retrieve all the records with value v in that field, without having to retrieve all the records in the file.
For example, we could have a file of (name, address, phone) triples, and an index on the
phone field. Given a phone number, the index allows us to find quickly the record or records with
that phone number.
A hash table is one simple way to build an index. The field or fields on which the index is based
form the hash-key for a hash function. We apply the hash function applied to value of the hash-key
for each record, and the record itself is placed in the bucket whose number is determined by the hash
function. The bucket could be a list of records in main-memory, or a disk block, for example.
Then, given a hash-key value, we can hash it, find the bucket, and need to search only that
bucket to find the records with that value for the hash-key.
Secondary Storage
It is important, when dealing with large-scale data, that we have a good understanding of the
difference in time taken to perform computations when the data is initially on disk, as opposed to the
time needed if the data is initially in main memory.
Disks are organized into blocks, which are the minimum units that the operating system uses
to move data between main memory and disk. For example, the Windows operating system uses

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 7

blocks of 64K bytes (i.e., 216 = 65,536 bytes to be exact). It takes approximately ten milliseconds to
access (move the disk head to the track of the block and wait for the block to rotate under the head)
and read a disk block. That delay is at least five orders of magnitude (a factor of 105) slower than the
time taken to read a word from main memory, so if all we want to do is access a few bytes, there is
an overwhelming benefit to having data in main memory.
In fact, if we want to do something simple to every byte of a disk block, e.g., treat the block as
a bucket of a hash table and search for a particular value of the hash-key among all the records in
that bucket, then the time taken to move the block from disk to main memory will be far larger than
the time taken to do the computation.
By organizing our data so that related data is on a single cylinder (the collection of blocks
reachable at a fixed radius from the center of the disk, and therefore accessible without moving the
disk head), we can read all the blocks on the cylinder into main memory in considerably less than 10
milliseconds per block. You can assume that a disk cannot transfer data to main memory at more
than a hundred million bytes per second, no matter how that data is organized. That is not a problem
when your dataset is a megabyte. But a dataset of a hundred gigabytes or a terabyte presents
problems just accessing it, let alone doing anything useful with it.

The Base of Natural Logarithms

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 8

Power Laws

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 9

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 10

Probability Theory for Data Science Week 1
No ratings yet
Probability Theory for Data Science Week 1
60 pages
Introduction Data Mining
100% (1)
Introduction Data Mining
23 pages
Statistics Practice Set
No ratings yet
Statistics Practice Set
6 pages
3 Probability
100% (1)
3 Probability
54 pages
Introduction To Probability
No ratings yet
Introduction To Probability
66 pages
Unit II
No ratings yet
Unit II
140 pages
CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle
No ratings yet
CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle
27 pages
302- Unit-2 Data Representation and Sampling Technique
No ratings yet
302- Unit-2 Data Representation and Sampling Technique
25 pages
data handluing
No ratings yet
data handluing
108 pages
MMD1
No ratings yet
MMD1
17 pages
Garv Gupta X HT
No ratings yet
Garv Gupta X HT
59 pages
ApplStat2007ZK
No ratings yet
ApplStat2007ZK
124 pages
Unit 3
No ratings yet
Unit 3
43 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
ZC-417 Quantitative Methods Exam Notes
No ratings yet
ZC-417 Quantitative Methods Exam Notes
144 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
02 Data
No ratings yet
02 Data
35 pages
STA02A2_Chapter 1
No ratings yet
STA02A2_Chapter 1
25 pages
Probability Notes
No ratings yet
Probability Notes
14 pages
HTBMS- 262 Karan Jawahrani Black Book Project
No ratings yet
HTBMS- 262 Karan Jawahrani Black Book Project
100 pages
Stats_Notes
No ratings yet
Stats_Notes
81 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Understanding Inverse Document Frequency On Theoretical Arguments For IDF
No ratings yet
Understanding Inverse Document Frequency On Theoretical Arguments For IDF
19 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
BS Important Questions With Answers-2
No ratings yet
BS Important Questions With Answers-2
17 pages
Data Mining and C
No ratings yet
Data Mining and C
85 pages
Probability - Lecture 1
No ratings yet
Probability - Lecture 1
8 pages
Intro To Stat (STAT 111) by Ewens
No ratings yet
Intro To Stat (STAT 111) by Ewens
113 pages
CH-4 (1)
No ratings yet
CH-4 (1)
30 pages
Instant Download (Ebook PDF) Java Foundations: Introduction To Program Design and Data Structures 5th Edition PDF All Chapter
83% (6)
Instant Download (Ebook PDF) Java Foundations: Introduction To Program Design and Data Structures 5th Edition PDF All Chapter
51 pages
Basic Statistics Chapter 5-7
No ratings yet
Basic Statistics Chapter 5-7
32 pages
Probability and Statistics
No ratings yet
Probability and Statistics
65 pages
Count Distributions For Autoregressive Conditional Duration Model
No ratings yet
Count Distributions For Autoregressive Conditional Duration Model
2 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Class Notes 1 KEY Math 1114
No ratings yet
Class Notes 1 KEY Math 1114
11 pages
Smith (2020)
No ratings yet
Smith (2020)
13 pages
Data Mining for Exam
No ratings yet
Data Mining for Exam
10 pages
MMPP Event Detection
No ratings yet
MMPP Event Detection
22 pages
John - Fields - HW1 Data Mining
No ratings yet
John - Fields - HW1 Data Mining
10 pages
Installing RBS
No ratings yet
Installing RBS
305 pages
Inferential Statistics
No ratings yet
Inferential Statistics
37 pages
1 - BBA - Probability and Statistics - Week-1
No ratings yet
1 - BBA - Probability and Statistics - Week-1
36 pages
02 Data
No ratings yet
02 Data
64 pages
statistics notes part - 1
No ratings yet
statistics notes part - 1
25 pages
Download Full Foundations of digital art and design with Adobe Creative Cloud Second Edition. Edition Xtine Burrough PDF All Chapters
100% (2)
Download Full Foundations of digital art and design with Adobe Creative Cloud Second Edition. Edition Xtine Burrough PDF All Chapters
65 pages
Probability
No ratings yet
Probability
138 pages
Lecture 3 Variables and Data Preprocessing
No ratings yet
Lecture 3 Variables and Data Preprocessing
17 pages
Building Microservices with NET Core 2 0 Transitioning Monolithic Architectures Using Microservices with NET Core 2 0 Using C 7 0 Gaurav Aroraa - Quickly download the ebook to never miss any content
100% (1)
Building Microservices with NET Core 2 0 Transitioning Monolithic Architectures Using Microservices with NET Core 2 0 Using C 7 0 Gaurav Aroraa - Quickly download the ebook to never miss any content
61 pages
GE 04 OL Teaching Part 1
No ratings yet
GE 04 OL Teaching Part 1
7 pages
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
No ratings yet
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
7 pages
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
No ratings yet
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
7 pages
Technical Seminar On: Face Recognition Based On Convolution Neural Network
No ratings yet
Technical Seminar On: Face Recognition Based On Convolution Neural Network
22 pages
Ch2 Outline
No ratings yet
Ch2 Outline
9 pages
ch04 BasicProbability
No ratings yet
ch04 BasicProbability
11 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Cryptography and Network Security
No ratings yet
Cryptography and Network Security
2 pages
ISO 03964-2016
No ratings yet
ISO 03964-2016
16 pages
ReVox A77 Serv Man Complete
100% (2)
ReVox A77 Serv Man Complete
114 pages
OSTA WS2024 Tutorial Session 01
No ratings yet
OSTA WS2024 Tutorial Session 01
19 pages
Global Perspectives 2 Unit 2 Learner Worksheets
No ratings yet
Global Perspectives 2 Unit 2 Learner Worksheets
12 pages
Statistics For Business Topic - Chapter 5 - Probability
No ratings yet
Statistics For Business Topic - Chapter 5 - Probability
1 page
Rikansha Ria Kumar Year 13
No ratings yet
Rikansha Ria Kumar Year 13
20 pages
Basics of Probability Theory
No ratings yet
Basics of Probability Theory
42 pages
Types of Probability
No ratings yet
Types of Probability
4 pages
Parts and Characteristics of A Parabola
No ratings yet
Parts and Characteristics of A Parabola
19 pages
Chapter 9 Algorithm Design and Problem Solving
No ratings yet
Chapter 9 Algorithm Design and Problem Solving
29 pages
Academic Writing
No ratings yet
Academic Writing
8 pages
Amadeus Review2
No ratings yet
Amadeus Review2
34 pages
Blacklisted Employers
No ratings yet
Blacklisted Employers
2 pages
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
No ratings yet
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
28 pages
QM questons
No ratings yet
QM questons
6 pages
Shivam Bhadani
No ratings yet
Shivam Bhadani
1 page
Ne Report
No ratings yet
Ne Report
6 pages
Lecture 9 Statistical Learning
No ratings yet
Lecture 9 Statistical Learning
3 pages
Gps Plotter GD3300 Brochure PDF
100% (1)
Gps Plotter GD3300 Brochure PDF
4 pages
(FREE) Launch Easydiag Full Activation Step by Step
50% (2)
(FREE) Launch Easydiag Full Activation Step by Step
3 pages
Internship 2024-25 Program
No ratings yet
Internship 2024-25 Program
6 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Maithili Joshi 20188892 Capgemini Candidate Report PDF
No ratings yet
Maithili Joshi 20188892 Capgemini Candidate Report PDF
5 pages
Architectural Details of Tesla GPU Microarchitecture
No ratings yet
Architectural Details of Tesla GPU Microarchitecture
9 pages
Chiller Pipeline
No ratings yet
Chiller Pipeline
2 pages
LED Flasher
No ratings yet
LED Flasher
4 pages
Alantek UL & 3P Certificate Data Cable - CAT5E PDF
No ratings yet
Alantek UL & 3P Certificate Data Cable - CAT5E PDF
1 page
DFDFFFF
No ratings yet
DFDFFFF
1 page
Contoh Tiket Bioskop
No ratings yet
Contoh Tiket Bioskop
1 page
RHCSA Mockpaperpractice
No ratings yet
RHCSA Mockpaperpractice
4 pages
Raw Data Is an Oxymoron
From Everand
Raw Data Is an Oxymoron
Lisa Gitelman
No ratings yet
Practical Statistics Simply Explained
From Everand
Practical Statistics Simply Explained
Dr. Russell A. Langley
3.5/5 (3)
Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection
From Everand
Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection
Mark J. Nigrini
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit I - MMD - Lecture NoteStu

Uploaded by

Unit I - MMD - Lecture NoteStu

Uploaded by

MINING MASSIVE DATASETS

Statistical Limits on Data Mining: Total Information Awareness, Bonferroni„s

Statistical Limits on Data Mining:

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 2

Thus, the number of pairs of people is

The number of pairs of days is

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 3

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 4

It is normally computed as follows. Suppose we have a collection of N documents. Deﬁne fij

frequency TFij to be:

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 5

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 6

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 7

The Base of Natural Logarithms

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 8

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 9

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.