Unit I - MMD - Lecture NoteStu
Unit I - MMD - Lecture NoteStu
(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES
−9 5
hotels. Thus, the chance that they will visit the same hotel on one given day is 10 , i.e., 0.0001 / 10
The chance that they will visit the same hotel on two different given days is the square of this
−18 −9 −9
number, 10 , i.e., 10 *10 Note that the hotels can be different on the two days.
Now, we must consider how many events will indicate evil-doing. An “event” in this sense is
a pair of people and a pair of days, such that the two people were at the same hotel on each of the two
days.
2
To simplify the arithmetic, note that for large n, is about n /2. stands for nCr
formula is also known as the "combinations formula". We shall use this approximation in what
follows.
That number is
That is, there will be a quarter of a million pairs of people who look like evil-doers, even
to be the frequency (number of occurrences) of term (word) i in document j. Then, define the term
the collection. Then IDFi = log2(N/ni). The TF.IDF score for term i in document j is then defined to
be TFij × IDFi. The terms with the highest TF.IDF score are often the terms that best characterize
the topic of the document.
Hash Functions
The hash functions that make hash tables feasible are also essential components in a number
of data-mining algorithms, where the hash table takes an unfamiliar form.
First, a hash function h takes a hash-key value as an argument and produces a bucket number
as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number
of buckets. Hash-keys can be of any type.
There is an intuitive property of hash functions that they “randomize” hash-keys. To be
precise, if hash-keys are drawn randomly from a reasonable population of possible hash-keys, then h
will send approximately equal numbers of hash-keys to each of the B buckets.
It would be impossible to do so if, for example, the population of possible hash-keys were