Data Mining Formula
Data Mining Formula
between data points or observations. These measures are often used in clustering, classification, and
dimensionality reduction tasks. Some common proximity measures include:
1. Euclidean Distance:
Example: Calculate the Euclidean distance between points A(2, 3) and B(5, 7).
Example: Calculate the Manhattan distance between points A(2, 3) and B(5, 7).
Example: Calculate the Minkowski distance with p=3 between points A(2, 3) and B(5,
7).
4. Cosine Similarity:
Example: Calculate the cosine similarity between vectors A(2, 3) and B(5, 7).
Calculation: (25 + 37) / (√(2^2 + 3^2) * √(5^2 + 7^2)) = (10 + 21) / (√(4 + 9) * √(25 +
49)) = 31 / (√13 * √74) ≈ 0.62
Example: Calculate the Jaccard similarity between sets A = {1, 2, 3} and B = {2, 3, 4}.
Calculation: (2 common elements {2, 3}) / (5 total distinct elements {1, 2, 3, 4, 5}) =
2/5 = 0.4
These are some common proximity measures used in statistics, but there are many others
depending on the specific data and context of the analysis. These measures help quantify the
relationships between data points, which is essential in various statistical and machine learning
applications.
1. Euclidean Distance: Calculate the Euclidean distance between the following pairs of points:
2. Manhattan Distance: Compute the Manhattan distance for the following pairs of points:
3. Minkowski Distance: Calculate the Minkowski distance with p=2 (Euclidean distance) for the
following pairs of points:
4. Cosine Similarity: Find the cosine similarity between the following pairs of vectors:
5. Jaccard Similarity: Calculate the Jaccard similarity for the following pairs of sets (consider
binary data):
Question 1: Euclidean Distance Calculate the Euclidean distance between Point A (2, 3) and Point B
(5, 7).
Answer 1: Euclidean Distance = √((x2 - x1)^2 + (y2 - y1)^2) Euclidean Distance = √((5 - 2)^2 + (7 -
3)^2) Euclidean Distance = √(3^2 + 4^2) Euclidean Distance = √(9 + 16) Euclidean Distance = √25
Euclidean Distance = 5
Question 2: Manhattan Distance Compute the Manhattan distance between Point X (3, 2) and Point
Y (6, 5).
Question 3: Cosine Similarity Find the cosine similarity between Vector U (1, 2) and Vector V (3, 4).
Answer 3: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (13 + 24) / (√(1^2 + 2^2) *
√(3^2 + 4^2)) Cosine Similarity = (3 + 8) / (√(1 + 4) * √(9 + 16)) Cosine Similarity = 11 / (√5 * √25)
Cosine Similarity = 11 / (5 * 5) Cosine Similarity = 11 / 25
These numerical questions and answers demonstrate calculations using proximity measures in
statistics.
Answer 1: Minkowski Distance = (∑(|xi - yi|^p))^(1/p) Minkowski Distance = ((|1 - 4|^3) + (|2 - 5|^3)
+ (|3 - 6|^3))^(1/3) Minkowski Distance = ((|-3|^3 + |-3|^3 + |-3|^3))^(1/3) Minkowski Distance =
((-27 - 27 - 27))^(1/3) Minkowski Distance = (-81)^(1/3) Minkowski Distance ≈ -4.3267
Question 2: Cosine Similarity in High Dimensions Calculate the cosine similarity between two high-
dimensional vectors, U = [1, 0, 0, 0, ...] (1000 dimensions) and V = [0, 1, 0, 0, ...] (1000 dimensions),
where each vector has 999 zeros followed by a 1.
Answer 2: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (10 + 01 + 00 + ... + 00) /
(√(1^2 + 0^2 + 0^2 + ... + 0^2) * √(0^2 + 1^2 + 0^2 + ... + 0^2)) Cosine Similarity = 0 / (√1 * √1) Cosine
Similarity = 0
Question 3: Jaccard Similarity with Sets Calculate the Jaccard similarity for the following pairs of
sets (consider binary data):
Answer 3: Jaccard Similarity = (Number of common elements) / (Total number of distinct elements)
For Set A and Set B: Jaccard Similarity = (3 common elements {3, 4, 5}) / (7 total distinct
elements {1, 2, 3, 4, 5, 6, 7}) = 3/7
For Set X and Set Y: Jaccard Similarity = (3 common elements {c, d, e}) / (7 total distinct
elements {a, b, c, d, e, f, g}) = 3/7
### Discretization
Question 1: Equal Width Discretization Suppose you have a dataset of ages (in
years) for a group of people: [22, 25, 30, 35, 40, 45, 50, 55, 60, 65]. Perform
equal-width discretization into three bins. What are the boundaries of these
bins?
Answer 1: To perform equal-width discretization, divide the range of ages into
three equal-width intervals:
1. 22 - 35
2. 36 - 50
3. 51 - 65
So, the boundaries of the bins are [22, 35], [36, 50], and [51, 65].
Question 2: Equal Frequency Discretization Given a dataset of exam scores for
a class: [60, 65, 70, 75, 80, 85, 90, 95], perform equal-frequency discretization
into three bins. What are the boundaries of these bins?
Answer 2: To perform equal-frequency discretization, divide the data into three
bins such that each bin contains an equal number of data points. Since there are
8 data points, each bin should contain 8 / 3 ≈ 2.67 data points (rounded up to 3).
1. [60, 70]
2. [75, 85]
3. [90, 95]
The boundaries of the bins are [60, 70], [75, 85], and [90, 95].
Question 3: Entropy-Based Discretization Suppose you have a dataset of
temperatures (in degrees Celsius) for a month: [10, 12, 14, 16, 18, 20, 22, 24,
26, 28]. Use entropy-based discretization to split the data into two bins. What
are the boundaries of these bins?
Answer 3: Entropy-based discretization aims to minimize the entropy of the
resulting bins. The algorithm will try to find the best split point that minimizes
entropy.
#### Binning
Question: You have a dataset of 1000 exam scores ranging from 0 to 100. How would you
create bins for this data using equal width binning with 5 bins?
Answer: To create equal width bins, you can calculate the bin width as (Max Value - Min
Value) / Number of Bins. In this case, it's (100 - 0) / 5 = 20. So, the bins would be:
Bin 1: 0-19
Bin 2: 20-39
Bin 3: 40-59
Bin 4: 60-79
Bin 5: 80-100
2. Question: Given a dataset of ages for a population, how would you create bins using
equal frequency binning with 4 bins?
Answer: Equal frequency binning divides the data into bins such that each bin contains
approximately the same number of data points. Here's how you can do it:
Sort the ages in ascending order.
Determine the quartiles (Q1, Q2, Q3).
Divide the data into 4 bins using these quartiles:
Bin 1: Ages less than or equal to Q1
Bin 2: Ages between Q1 and Q2
Bin 3: Ages between Q2 and Q3
Bin 4: Ages greater than or equal to Q3
3. Question: You have a dataset of the weights of 200 apples in grams. How would you
create bins using custom binning with specific weight ranges (e.g., 100-150g, 151-
200g, 201-250g)?
Answer: Custom binning allows you to define specific bin ranges. In this case, you can create
bins as follows:
Bin 1: 100-150g
Bin 2: 151-200g
Bin 3: 201-250g
4. Question: You have a dataset of 500 sales transactions, and you want to create bins for
transaction amounts using the natural breaks (Jenks) method. How many bins should
you create?
Answer: To determine the number of bins using the natural breaks (Jenks) method, you
would need to use a statistical algorithm to find the optimal number of bins that minimize
within-bin variance. There isn't a fixed number of bins, as it depends on the data distribution.
You would typically use software or libraries (e.g., Jenks Natural Breaks Classification) to
calculate the optimal number of bins.
5. Question: You have a dataset of 50 temperatures recorded in degrees Celsius. How
would you create bins using quantile binning with 3 bins?
Answer: Quantile binning divides the data into bins based on specified quantiles. To create 3
bins, you can use the following approach:
Calculate the 33rd percentile (Q1), and 66th percentile (Q2).
Divide the data into 3 bins as follows:
Bin 1: Temperatures less than or equal to Q1
Bin 2: Temperatures between Q1 and Q2
Bin 3: Temperatures greater than Q2
## covariance matrix
1. Question: Given two variables, X and Y, with the following data points:
X: [3, 4, 5, 6, 7] Y: [2, 3, 4, 5, 6]
Calculate the covariance between X and Y.
Answer: To calculate the covariance between two variables X and Y, you can use the
formula:
Cov(X, Y) = Σ[(Xᵢ - mean(X)) * (Yᵢ - mean(Y))] / (n - 1)
First, calculate the means:
Mean(X) = (3 + 4 + 5 + 6 + 7) / 5 = 5
Mean(Y) = (2 + 3 + 4 + 5 + 6) / 5 = 4
Then, calculate the covariance:
Cov(X, Y) = [(3 - 5) * (2 - 4) + (4 - 5) * (3 - 4) + (5 - 5) * (4 - 4) + (6 - 5) * (5
- 4) + (7 - 5) * (6 - 4)] / (5 - 1)
Cov(X, Y) = (-4 + (-1) + 0 + 1 + 4) / 4
Cov(X, Y) = 0.5
So, the covariance between X and Y is 0.5.
2. Question: Given the following data representing three variables X, Y, and Z:
X: [2, 3, 4, 5, 6] Y: [1, 2, 3, 4, 5] Z: [4, 3, 2, 1, 0]
Calculate the covariance matrix for these variables.
Answer: To calculate the covariance matrix for three variables, you need to compute the
covariance between each pair of variables. The covariance matrix is a symmetric matrix
where each element represents the covariance between two variables. Here's how to calculate
it:
Cov(X, X) = Cov(Y, Y) = Cov(Z, Z) = Variance of each variable
First, calculate the variances for each variable:
Var(X) = [(2-4)² + (3-4)² + (4-4)² + (5-4)² + (6-4)²] / 4 = 2.5
Var(Y) = [(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²] / 4 = 2.5
Var(Z) = [(4-4)² + (3-4)² + (2-4)² + (1-4)² + (0-4)²] / 4 = 5
Now, calculate the covariances:
Cov(X, Y) = [(2-4)(1-3) + (3-4)(2-3) + (4-4)(3-3) + (5-4)(4-3) + (6-4)*(5-3)] /
4 = -2.5
Cov(X, Z) = [(2-4)(4-4) + (3-4)(3-4) + (4-4)(2-4) + (5-4)(1-4) + (6-4)*(0-4)] /
4 = -3.5
Cov(Y, Z) = [(1-3)(4-4) + (2-3)(3-4) + (3-3)(2-4) + (4-3)(1-4) + (5-3)*(0-4)] /
4 = -3.5
| Var(X) Cov(X, Y) Cov(X, Z) |
| Cov(Y, X) Var(Y) Cov(Y, Z) |
| Cov(Z, X) Cov(Z, Y) Var(Z) |