0% found this document useful (0 votes)
4 views7 pages

Data Mining Formula

Data mining

Uploaded by

Shubham Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Data Mining Formula

Data mining

Uploaded by

Shubham Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

In statistics, proximity measures are used to quantify the degree of similarity or dissimilarity

between data points or observations. These measures are often used in clustering, classification, and
dimensionality reduction tasks. Some common proximity measures include:

1. Euclidean Distance:

 Formula: √((x2 - x1)^2 + (y2 - y1)^2)

 Example: Calculate the Euclidean distance between points A(2, 3) and B(5, 7).

 Calculation: √((5 - 2)^2 + (7 - 3)^2) = √(3^2 + 4^2) = √(9 + 16) = √25 = 5

2. Manhattan Distance (L1 Distance):

 Formula: |x2 - x1| + |y2 - y1|

 Example: Calculate the Manhattan distance between points A(2, 3) and B(5, 7).

 Calculation: |5 - 2| + |7 - 3| = |3| + |4| = 3 + 4 = 7

3. Minkowski Distance (Generalized Distance):

 Formula: (∑(|xi - yi|^p))^(1/p)

 Example: Calculate the Minkowski distance with p=3 between points A(2, 3) and B(5,
7).

 Calculation: ((|2 - 5|^3) + (|3 - 7|^3))^(1/3) = (|-3|^3 + |-4|^3)^(1/3) = (27 +


64)^(1/3) = 91^(1/3) ≈ 4.5

4. Cosine Similarity:

 Formula: (A · B) / (||A|| * ||B||)

 Example: Calculate the cosine similarity between vectors A(2, 3) and B(5, 7).

 Calculation: (25 + 37) / (√(2^2 + 3^2) * √(5^2 + 7^2)) = (10 + 21) / (√(4 + 9) * √(25 +
49)) = 31 / (√13 * √74) ≈ 0.62

5. Jaccard Similarity (for binary data):

 Formula: (Number of common elements) / (Total number of distinct elements)

 Example: Calculate the Jaccard similarity between sets A = {1, 2, 3} and B = {2, 3, 4}.

 Calculation: (2 common elements {2, 3}) / (5 total distinct elements {1, 2, 3, 4, 5}) =
2/5 = 0.4

These are some common proximity measures used in statistics, but there are many others
depending on the specific data and context of the analysis. These measures help quantify the
relationships between data points, which is essential in various statistical and machine learning
applications.

Questions for practice

1. Euclidean Distance: Calculate the Euclidean distance between the following pairs of points:

 Point A (3, 4) and Point B (7, 9)


 Point X (1, 2, 3) and Point Y (4, 5, 6)

2. Manhattan Distance: Compute the Manhattan distance for the following pairs of points:

 Point P (2, 3) and Point Q (6, 8)

 Point M (1, 2, 3) and Point N (4, 5, 6)

3. Minkowski Distance: Calculate the Minkowski distance with p=2 (Euclidean distance) for the
following pairs of points:

 Point A (3, 4) and Point B (7, 9)

 Point X (1, 2, 3) and Point Y (4, 5, 6)

4. Cosine Similarity: Find the cosine similarity between the following pairs of vectors:

 Vector U (2, 4, 1) and Vector V (1, 3, 5)

 Vector W (0, 1, 2, 3) and Vector Z (1, 2, 3, 4)

5. Jaccard Similarity: Calculate the Jaccard similarity for the following pairs of sets (consider
binary data):

 Set S1 = {1, 2, 3} and Set S2 = {2, 3, 4}

 Set S3 = {A, B, C} and Set S4 = {B, C, D, E}

Question 1: Euclidean Distance Calculate the Euclidean distance between Point A (2, 3) and Point B
(5, 7).

Answer 1: Euclidean Distance = √((x2 - x1)^2 + (y2 - y1)^2) Euclidean Distance = √((5 - 2)^2 + (7 -
3)^2) Euclidean Distance = √(3^2 + 4^2) Euclidean Distance = √(9 + 16) Euclidean Distance = √25
Euclidean Distance = 5

The Euclidean distance between Point A and Point B is 5 units.

Question 2: Manhattan Distance Compute the Manhattan distance between Point X (3, 2) and Point
Y (6, 5).

Answer 2: Manhattan Distance = |x2 - x1| + |y2 - y1| Manhattan Distance = |6 - 3| + |5 - 2|


Manhattan Distance = |3| + |3| Manhattan Distance = 3 + 3 Manhattan Distance = 6

The Manhattan distance between Point X and Point Y is 6 units.

Question 3: Cosine Similarity Find the cosine similarity between Vector U (1, 2) and Vector V (3, 4).

Answer 3: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (13 + 24) / (√(1^2 + 2^2) *
√(3^2 + 4^2)) Cosine Similarity = (3 + 8) / (√(1 + 4) * √(9 + 16)) Cosine Similarity = 11 / (√5 * √25)
Cosine Similarity = 11 / (5 * 5) Cosine Similarity = 11 / 25

The cosine similarity between Vector U and Vector V is 11/25.

These numerical questions and answers demonstrate calculations using proximity measures in
statistics.

some more challenging numerical questions related to proximity measures in


statistics along with their answers:
Question 1: Minkowski Distance Calculate the Minkowski distance with p=3 between Point P (1, 2,
3) and Point Q (4, 5, 6).

Answer 1: Minkowski Distance = (∑(|xi - yi|^p))^(1/p) Minkowski Distance = ((|1 - 4|^3) + (|2 - 5|^3)
+ (|3 - 6|^3))^(1/3) Minkowski Distance = ((|-3|^3 + |-3|^3 + |-3|^3))^(1/3) Minkowski Distance =
((-27 - 27 - 27))^(1/3) Minkowski Distance = (-81)^(1/3) Minkowski Distance ≈ -4.3267

Question 2: Cosine Similarity in High Dimensions Calculate the cosine similarity between two high-
dimensional vectors, U = [1, 0, 0, 0, ...] (1000 dimensions) and V = [0, 1, 0, 0, ...] (1000 dimensions),
where each vector has 999 zeros followed by a 1.

Answer 2: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (10 + 01 + 00 + ... + 00) /
(√(1^2 + 0^2 + 0^2 + ... + 0^2) * √(0^2 + 1^2 + 0^2 + ... + 0^2)) Cosine Similarity = 0 / (√1 * √1) Cosine
Similarity = 0

In high dimensions, the cosine similarity between orthogonal vectors is 0.

Question 3: Jaccard Similarity with Sets Calculate the Jaccard similarity for the following pairs of
sets (consider binary data):

 Set A = {1, 2, 3, 4, 5} and Set B = {3, 4, 5, 6, 7}

 Set X = {a, b, c, d, e} and Set Y = {c, d, e, f, g}

Answer 3: Jaccard Similarity = (Number of common elements) / (Total number of distinct elements)

 For Set A and Set B: Jaccard Similarity = (3 common elements {3, 4, 5}) / (7 total distinct
elements {1, 2, 3, 4, 5, 6, 7}) = 3/7

 For Set X and Set Y: Jaccard Similarity = (3 common elements {c, d, e}) / (7 total distinct
elements {a, b, c, d, e, f, g}) = 3/7

The Jaccard similarity for both pairs of sets is 3/7.

### Discretization
Question 1: Equal Width Discretization Suppose you have a dataset of ages (in
years) for a group of people: [22, 25, 30, 35, 40, 45, 50, 55, 60, 65]. Perform
equal-width discretization into three bins. What are the boundaries of these
bins?
Answer 1: To perform equal-width discretization, divide the range of ages into
three equal-width intervals:
1. 22 - 35
2. 36 - 50
3. 51 - 65
So, the boundaries of the bins are [22, 35], [36, 50], and [51, 65].
Question 2: Equal Frequency Discretization Given a dataset of exam scores for
a class: [60, 65, 70, 75, 80, 85, 90, 95], perform equal-frequency discretization
into three bins. What are the boundaries of these bins?
Answer 2: To perform equal-frequency discretization, divide the data into three
bins such that each bin contains an equal number of data points. Since there are
8 data points, each bin should contain 8 / 3 ≈ 2.67 data points (rounded up to 3).
1. [60, 70]
2. [75, 85]
3. [90, 95]
The boundaries of the bins are [60, 70], [75, 85], and [90, 95].
Question 3: Entropy-Based Discretization Suppose you have a dataset of
temperatures (in degrees Celsius) for a month: [10, 12, 14, 16, 18, 20, 22, 24,
26, 28]. Use entropy-based discretization to split the data into two bins. What
are the boundaries of these bins?
Answer 3: Entropy-based discretization aims to minimize the entropy of the
resulting bins. The algorithm will try to find the best split point that minimizes
entropy.

Entropy for a set S is calculated as:)Entropy(S)=−p1​∗log2(p1​)−p2​∗log2(p2​)


Where p1​ is the proportion of data points in one class, and p2​ is the proportion
in the other class.
Entropy for the entire dataset: ([10,12,14,16,18,20,22,24,26,28])
Entropy([10,12,14,16,18,20,22,24,26,28])
Now, calculate the entropy for different split points and find the one that
minimizes entropy. The split point that minimizes entropy is 20. So, the
boundaries of the bins are [10, 20] and [22, 28].
These questions demonstrate different discretization techniques and how to
calculate bin boundaries based on various criteria.

#### Binning

Question: You have a dataset of 1000 exam scores ranging from 0 to 100. How would you
create bins for this data using equal width binning with 5 bins?
Answer: To create equal width bins, you can calculate the bin width as (Max Value - Min
Value) / Number of Bins. In this case, it's (100 - 0) / 5 = 20. So, the bins would be:
 Bin 1: 0-19
 Bin 2: 20-39
 Bin 3: 40-59
 Bin 4: 60-79
 Bin 5: 80-100
2. Question: Given a dataset of ages for a population, how would you create bins using
equal frequency binning with 4 bins?
Answer: Equal frequency binning divides the data into bins such that each bin contains
approximately the same number of data points. Here's how you can do it:
 Sort the ages in ascending order.
 Determine the quartiles (Q1, Q2, Q3).
 Divide the data into 4 bins using these quartiles:
 Bin 1: Ages less than or equal to Q1
 Bin 2: Ages between Q1 and Q2
 Bin 3: Ages between Q2 and Q3
 Bin 4: Ages greater than or equal to Q3
3. Question: You have a dataset of the weights of 200 apples in grams. How would you
create bins using custom binning with specific weight ranges (e.g., 100-150g, 151-
200g, 201-250g)?
Answer: Custom binning allows you to define specific bin ranges. In this case, you can create
bins as follows:
 Bin 1: 100-150g
 Bin 2: 151-200g
 Bin 3: 201-250g
4. Question: You have a dataset of 500 sales transactions, and you want to create bins for
transaction amounts using the natural breaks (Jenks) method. How many bins should
you create?
Answer: To determine the number of bins using the natural breaks (Jenks) method, you
would need to use a statistical algorithm to find the optimal number of bins that minimize
within-bin variance. There isn't a fixed number of bins, as it depends on the data distribution.
You would typically use software or libraries (e.g., Jenks Natural Breaks Classification) to
calculate the optimal number of bins.
5. Question: You have a dataset of 50 temperatures recorded in degrees Celsius. How
would you create bins using quantile binning with 3 bins?
Answer: Quantile binning divides the data into bins based on specified quantiles. To create 3
bins, you can use the following approach:
 Calculate the 33rd percentile (Q1), and 66th percentile (Q2).
 Divide the data into 3 bins as follows:
 Bin 1: Temperatures less than or equal to Q1
 Bin 2: Temperatures between Q1 and Q2
 Bin 3: Temperatures greater than Q2
## covariance matrix
1. Question: Given two variables, X and Y, with the following data points:
X: [3, 4, 5, 6, 7] Y: [2, 3, 4, 5, 6]
Calculate the covariance between X and Y.
Answer: To calculate the covariance between two variables X and Y, you can use the
formula:
Cov(X, Y) = Σ[(Xᵢ - mean(X)) * (Yᵢ - mean(Y))] / (n - 1)
First, calculate the means:
 Mean(X) = (3 + 4 + 5 + 6 + 7) / 5 = 5
 Mean(Y) = (2 + 3 + 4 + 5 + 6) / 5 = 4
Then, calculate the covariance:
 Cov(X, Y) = [(3 - 5) * (2 - 4) + (4 - 5) * (3 - 4) + (5 - 5) * (4 - 4) + (6 - 5) * (5
- 4) + (7 - 5) * (6 - 4)] / (5 - 1)
 Cov(X, Y) = (-4 + (-1) + 0 + 1 + 4) / 4
 Cov(X, Y) = 0.5
So, the covariance between X and Y is 0.5.
2. Question: Given the following data representing three variables X, Y, and Z:
X: [2, 3, 4, 5, 6] Y: [1, 2, 3, 4, 5] Z: [4, 3, 2, 1, 0]
Calculate the covariance matrix for these variables.
Answer: To calculate the covariance matrix for three variables, you need to compute the
covariance between each pair of variables. The covariance matrix is a symmetric matrix
where each element represents the covariance between two variables. Here's how to calculate
it:
 Cov(X, X) = Cov(Y, Y) = Cov(Z, Z) = Variance of each variable
First, calculate the variances for each variable:
 Var(X) = [(2-4)² + (3-4)² + (4-4)² + (5-4)² + (6-4)²] / 4 = 2.5
 Var(Y) = [(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²] / 4 = 2.5
 Var(Z) = [(4-4)² + (3-4)² + (2-4)² + (1-4)² + (0-4)²] / 4 = 5
Now, calculate the covariances:
 Cov(X, Y) = [(2-4)(1-3) + (3-4)(2-3) + (4-4)(3-3) + (5-4)(4-3) + (6-4)*(5-3)] /
4 = -2.5
 Cov(X, Z) = [(2-4)(4-4) + (3-4)(3-4) + (4-4)(2-4) + (5-4)(1-4) + (6-4)*(0-4)] /
4 = -3.5
 Cov(Y, Z) = [(1-3)(4-4) + (2-3)(3-4) + (3-3)(2-4) + (4-3)(1-4) + (5-3)*(0-4)] /
4 = -3.5
| Var(X) Cov(X, Y) Cov(X, Z) |
| Cov(Y, X) Var(Y) Cov(Y, Z) |
| Cov(Z, X) Cov(Z, Y) Var(Z) |

| 2.5 -2.5 -3.5 | | -2.5 2.5 -3.5 | | -3.5 -3.5 5 |


3. Question: If the covariance between two variables X and Y is -0.8, what does it imply
about their relationship?
Answer: A covariance of -0.8 indicates a negative linear relationship between variables X and
Y. This means that as the values of X increase, the values of Y tend to decrease, and vice
versa. It suggests an inverse relationship between the two variables.
I hope these numerical questions and answers help you understand covariance matrices
better!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy