0% found this document useful (0 votes)
21 views10 pages

Unit I - MMD - Lecture NoteStu

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Unit I - MMD - Lecture NoteStu

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MINING MASSIVE DATASETS

(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES

Statistical Limits on Data Mining: Total Information Awareness, Bonferroni„s


Principle, An Example of Bonferroni„s Principle.
UNIT I Things Useful to Know: Importance of Words in Documents, Hash Functions,
Indexes, Secondary Storage, The Base of Natural Logarithms, Power Laws.

Statistical Limits on Data Mining:


A common sort of data-mining problem involves discovering unusual events hidden within
massive amounts of data.
Total Information Awareness:
Following the terrorist attack of Sept. 11, 2001, it was noticed that there were four people
enrolled in different flight schools, learning how to pilot commercial aircraft, although they were not
affiliated with any airline. It was conjectured that the information needed to predict and foil the
attack was available in data, but that there was then no way to examine the data and detect
suspicious events.
The response was a program called TIA, or Total Information Awareness, which was intended
to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many
other kinds of information in order to track terrorist activity.
Now information integration – the idea of relating and combining different data sources to
obtain insights that are not available from any one source – is often a key step on the way to solving
an important problem.
TIA naturally caused great concern among privacy advocates, and the project was eventually
killed by Congress.
However, the prospect of TIA or a system like it does raise many technical questions about its
feasibility.
One particular technical problem: if you look in your data for too many things at the same
time, you will see things that look interesting, but are in fact simply statistical artifacts and have
no significance.
Example:- If you search your data for activities that look like terrorist behavior, are you not
Mining Massive Datasets – UNIT 1 – Lecture Notes Page 1
going to find many innocent activities – or even illicit activities that are not terrorism – that will
result in visits from the police and maybe worse than just a visit? The answer is that it all depends on
how narrowly you define the activities that you look for.
Statisticians have seen this problem in many guises and have a theory, called, Bonferroni‟s
Principle.
Bonferroni‘s Principle
Suppose you have a certain amount of data, and you look for events of a certain type within
that data. You can expect events of this type to occur, even if the data is completely random, and the
number of occurrences of these events will grow as the size of the data grows.
These occurrences are “bogus,” in the sense that they have no cause other than that random
data will always have some number of unusual features that look significant but aren‟t.
A theorem of statistics, known as the Bonferroni correction gives a statistically sound way to
avoid most of these bogus positive responses to a search through the data.
Bonferroni‟s principle, that helps us avoid treating random occurrences as if they were real.
Calculate the expected number of occurrences of the events you are looking for, on the
assumption that data is random. If this number is significantly larger than the number of real
instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a
statistical artifact rather than evidence of what you are looking for.
This observation is the informal statement of Bonferroni‟s principle.
An Example of Bonferroni‘s Principle
Suppose there are believed to be some “evil-doers” out there, and we want to detect them.
Suppose further that we have reason to believe that periodically, evil-doers gather at a hotel to plot
their evil.
Let us make the following assumptions about the size of the problem:
1. There are one billion people who might be evil-doers.
2. Everyone goes to a hotel one day in 100.
3. A hotel holds 100 people. Hence, there are 100,000 hotels – enough to hold the 1% of a
billion people who visit a hotel on any given day.
4. We shall examine hotel records for 1000 days.
To find evil-doers in this data, we shall look for people who, on two different days, were both
at the same hotel. Suppose, however, that there really are no evil-doers. That is, everyone behaves at

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 2


random, deciding with probability 0.01 to visit a hotel on any given day, and if so, choosing one of
the 105 hotels at random.
Would we find any pairs of people who appear to be evil-doers?
We can do a simple approximate calculation as follows.
A person visiting a particular hotel is 0.01
Therefore the probability of any two people both deciding to visit a hotel on any given day is
0.01 * 0.01 = .0001
5
The chance that they will visit the same hotel is this probability divided by 10 , the number of

−9 5
hotels. Thus, the chance that they will visit the same hotel on one given day is 10 , i.e., 0.0001 / 10

The chance that they will visit the same hotel on two different given days is the square of this
−18 −9 −9
number, 10 , i.e., 10 *10 Note that the hotels can be different on the two days.

Now, we must consider how many events will indicate evil-doing. An “event” in this sense is
a pair of people and a pair of days, such that the two people were at the same hotel on each of the two
days.

2
To simplify the arithmetic, note that for large n, is about n /2. stands for nCr

formula is also known as the "combinations formula". We shall use this approximation in what
follows.

Thus, the number of pairs of people is

The number of pairs of days is


The expected number of events that look like evil-doing is the product of the number of pairs
of people, the number of pairs of days, and the probability that any one pair of people and pair of
days is an instance of the behavior we are looking for.

That number is
That is, there will be a quarter of a million pairs of people who look like evil-doers, even

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 3


though they are not.
Now, suppose there really are 10 pairs of evil-doers out there. The intrusion on the lives of
half a million innocent people, the work involved is sufficiently great that this approach to finding
evil-doers is probably not feasible.

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 4


Things Useful to Know:
Importance of Words in Documents (TF.IDF measure of word importance)
In several applications of data mining, we shall be faced with the problem of categorizing
documents (sequences of words) by their topic. Typically, topics are identified by finding the special
words that characterize documents about that topic.
For instance, articles about baseball would tend to have many occurrences of words like
“ball,” “bat,” “pitch,” “run,” and so on. Until we have made the classification, it is not possible to
identify these words as characteristic.
Thus, classification often starts by looking at documents, and finding the significant words in
those documents. Our first guess might be that the words appearing most frequently in a document
are the most significant. However, that intuition is exactly opposite of the truth.
The most frequent words will most surely be the common words such as “the” or “and,”
which help build ideas but do not carry any significance themselves. In fact, the several hundred
most common words in English (called stop words) are often removed from documents before any
attempt to classify them.
In fact, the indicators of the topic are relatively rare words. However, not all rare words are
equally useful as indicators. There are certain words, for example “notwithstanding” or “albeit,” that
appear rarely in a collection of documents, yet do not tell us anything useful.
On the other hand, a word like “chukker” is probably equally rare, but tips us off that the
document is about the sport of polo. The difference between rare words that tell us something and
those that do not has to do with the concentration of the useful words in just a few documents.
The formal measure of how concentrated into relatively few documents are the occurrences of
a given word is called TF.IDF (Term Frequency times In-verse Document Frequency).

It is normally computed as follows. Suppose we have a collection of N documents. Define fij

to be the frequency (number of occurrences) of term (word) i in document j. Then, define the term

frequency TFij to be:

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 5


The IDF for a term is defined as follows. Suppose term i appears in ni of the N documents in

the collection. Then IDFi = log2(N/ni). The TF.IDF score for term i in document j is then defined to

be TFij × IDFi. The terms with the highest TF.IDF score are often the terms that best characterize
the topic of the document.

Hash Functions
The hash functions that make hash tables feasible are also essential components in a number
of data-mining algorithms, where the hash table takes an unfamiliar form.
First, a hash function h takes a hash-key value as an argument and produces a bucket number
as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number
of buckets. Hash-keys can be of any type.
There is an intuitive property of hash functions that they “randomize” hash-keys. To be
precise, if hash-keys are drawn randomly from a reasonable population of possible hash-keys, then h
will send approximately equal numbers of hash-keys to each of the B buckets.
It would be impossible to do so if, for example, the population of possible hash-keys were

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 6


smaller than B. Such a population would not be “reasonable.”
Suppose hash-keys are positive integers. A common and simple hash function is to pick h(x) = x mod
B, that is, the remainder when x is divided by B. That choice works well if our population of hash-
keys is all positive integers. 1/Bth of the integers will be assigned to each of the buckets.
However, suppose our population is the even integers, and B = 10. Then only buckets 0, 2, 4, 6, and 8
can be the value of h(x), and the hash function is distinctly nonrandom in its behavior. On the other
hand, if we picked B = 11, then we would find that 1/11th of the even integers get sent to each of the
11 buckets, so the hash function would work well in this case.
Thus, it is normally preferred that we choose B to be a prime. That choice reduces the chance of non
random behavior, although we still have to consider the possibility that all hash-keys have B as a
factor.
Indexes
An index is a data structure that makes it efficient to retrieve objects given the value of one or
more elements of those objects. The most common situation is one where the objects are records,
and the index is on one of the fields of that record. Given a value v for that field, the index lets us
retrieve all the records with value v in that field, without having to retrieve all the records in the file.
For example, we could have a file of (name, address, phone) triples, and an index on the
phone field. Given a phone number, the index allows us to find quickly the record or records with
that phone number.
A hash table is one simple way to build an index. The field or fields on which the index is based
form the hash-key for a hash function. We apply the hash function applied to value of the hash-key
for each record, and the record itself is placed in the bucket whose number is determined by the hash
function. The bucket could be a list of records in main-memory, or a disk block, for example.
Then, given a hash-key value, we can hash it, find the bucket, and need to search only that
bucket to find the records with that value for the hash-key.
Secondary Storage
It is important, when dealing with large-scale data, that we have a good understanding of the
difference in time taken to perform computations when the data is initially on disk, as opposed to the
time needed if the data is initially in main memory.
Disks are organized into blocks, which are the minimum units that the operating system uses
to move data between main memory and disk. For example, the Windows operating system uses

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 7


blocks of 64K bytes (i.e., 216 = 65,536 bytes to be exact). It takes approximately ten milliseconds to
access (move the disk head to the track of the block and wait for the block to rotate under the head)
and read a disk block. That delay is at least five orders of magnitude (a factor of 105) slower than the
time taken to read a word from main memory, so if all we want to do is access a few bytes, there is
an overwhelming benefit to having data in main memory.
In fact, if we want to do something simple to every byte of a disk block, e.g., treat the block as
a bucket of a hash table and search for a particular value of the hash-key among all the records in
that bucket, then the time taken to move the block from disk to main memory will be far larger than
the time taken to do the computation.
By organizing our data so that related data is on a single cylinder (the collection of blocks
reachable at a fixed radius from the center of the disk, and therefore accessible without moving the
disk head), we can read all the blocks on the cylinder into main memory in considerably less than 10
milliseconds per block. You can assume that a disk cannot transfer data to main memory at more
than a hundred million bytes per second, no matter how that data is organized. That is not a problem
when your dataset is a megabyte. But a dataset of a hundred gigabytes or a terabyte presents
problems just accessing it, let alone doing anything useful with it.

The Base of Natural Logarithms

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 8


Power Laws

Mining Massive Datasets – UNIT 1 – Lecture Notes Page 9


Mining Massive Datasets – UNIT 1 – Lecture Notes Page 10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy