0% found this document useful (0 votes)

4 views7 pages

Data Mining Formula

Data mining

Uploaded by

Shubham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Data Mining Formula

Data mining

Uploaded by

Shubham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

In statistics, proximity measures are used to quantify the degree of similarity or dissimilarity

between data points or observations. These measures are often used in clustering, classification, and
dimensionality reduction tasks. Some common proximity measures include:

1. Euclidean Distance:

 Formula: √((x2 - x1)^2 + (y2 - y1)^2)

 Example: Calculate the Euclidean distance between points A(2, 3) and B(5, 7).

 Calculation: √((5 - 2)^2 + (7 - 3)^2) = √(3^2 + 4^2) = √(9 + 16) = √25 = 5

2. Manhattan Distance (L1 Distance):

 Formula: |x2 - x1| + |y2 - y1|

 Example: Calculate the Manhattan distance between points A(2, 3) and B(5, 7).

 Calculation: |5 - 2| + |7 - 3| = |3| + |4| = 3 + 4 = 7

3. Minkowski Distance (Generalized Distance):

 Formula: (∑(|xi - yi|^p))^(1/p)

 Example: Calculate the Minkowski distance with p=3 between points A(2, 3) and B(5,
7).

 Calculation: ((|2 - 5|^3) + (|3 - 7|^3))^(1/3) = (|-3|^3 + |-4|^3)^(1/3) = (27 +

64)^(1/3) = 91^(1/3) ≈ 4.5

4. Cosine Similarity:

 Formula: (A · B) / (||A|| * ||B||)

 Example: Calculate the cosine similarity between vectors A(2, 3) and B(5, 7).

 Calculation: (25 + 37) / (√(2^2 + 3^2) * √(5^2 + 7^2)) = (10 + 21) / (√(4 + 9) * √(25 +
49)) = 31 / (√13 * √74) ≈ 0.62

5. Jaccard Similarity (for binary data):

 Formula: (Number of common elements) / (Total number of distinct elements)

 Example: Calculate the Jaccard similarity between sets A = {1, 2, 3} and B = {2, 3, 4}.

 Calculation: (2 common elements {2, 3}) / (5 total distinct elements {1, 2, 3, 4, 5}) =
2/5 = 0.4

These are some common proximity measures used in statistics, but there are many others
depending on the specific data and context of the analysis. These measures help quantify the
relationships between data points, which is essential in various statistical and machine learning
applications.

Questions for practice

1. Euclidean Distance: Calculate the Euclidean distance between the following pairs of points:

 Point A (3, 4) and Point B (7, 9)

 Point X (1, 2, 3) and Point Y (4, 5, 6)

2. Manhattan Distance: Compute the Manhattan distance for the following pairs of points:

 Point P (2, 3) and Point Q (6, 8)

 Point M (1, 2, 3) and Point N (4, 5, 6)

3. Minkowski Distance: Calculate the Minkowski distance with p=2 (Euclidean distance) for the
following pairs of points:

 Point A (3, 4) and Point B (7, 9)

 Point X (1, 2, 3) and Point Y (4, 5, 6)

4. Cosine Similarity: Find the cosine similarity between the following pairs of vectors:

 Vector U (2, 4, 1) and Vector V (1, 3, 5)

 Vector W (0, 1, 2, 3) and Vector Z (1, 2, 3, 4)

5. Jaccard Similarity: Calculate the Jaccard similarity for the following pairs of sets (consider
binary data):

 Set S1 = {1, 2, 3} and Set S2 = {2, 3, 4}

 Set S3 = {A, B, C} and Set S4 = {B, C, D, E}

Question 1: Euclidean Distance Calculate the Euclidean distance between Point A (2, 3) and Point B
(5, 7).

Answer 1: Euclidean Distance = √((x2 - x1)^2 + (y2 - y1)^2) Euclidean Distance = √((5 - 2)^2 + (7 -
3)^2) Euclidean Distance = √(3^2 + 4^2) Euclidean Distance = √(9 + 16) Euclidean Distance = √25
Euclidean Distance = 5

The Euclidean distance between Point A and Point B is 5 units.

Question 2: Manhattan Distance Compute the Manhattan distance between Point X (3, 2) and Point
Y (6, 5).

Answer 2: Manhattan Distance = |x2 - x1| + |y2 - y1| Manhattan Distance = |6 - 3| + |5 - 2|

Manhattan Distance = |3| + |3| Manhattan Distance = 3 + 3 Manhattan Distance = 6

The Manhattan distance between Point X and Point Y is 6 units.

Question 3: Cosine Similarity Find the cosine similarity between Vector U (1, 2) and Vector V (3, 4).

Answer 3: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (13 + 24) / (√(1^2 + 2^2) *
√(3^2 + 4^2)) Cosine Similarity = (3 + 8) / (√(1 + 4) * √(9 + 16)) Cosine Similarity = 11 / (√5 * √25)
Cosine Similarity = 11 / (5 * 5) Cosine Similarity = 11 / 25

The cosine similarity between Vector U and Vector V is 11/25.

These numerical questions and answers demonstrate calculations using proximity measures in
statistics.

some more challenging numerical questions related to proximity measures in

statistics along with their answers:
Question 1: Minkowski Distance Calculate the Minkowski distance with p=3 between Point P (1, 2,
3) and Point Q (4, 5, 6).

Answer 1: Minkowski Distance = (∑(|xi - yi|^p))^(1/p) Minkowski Distance = ((|1 - 4|^3) + (|2 - 5|^3)
+ (|3 - 6|^3))^(1/3) Minkowski Distance = ((|-3|^3 + |-3|^3 + |-3|^3))^(1/3) Minkowski Distance =
((-27 - 27 - 27))^(1/3) Minkowski Distance = (-81)^(1/3) Minkowski Distance ≈ -4.3267

Question 2: Cosine Similarity in High Dimensions Calculate the cosine similarity between two high-
dimensional vectors, U = [1, 0, 0, 0, ...] (1000 dimensions) and V = [0, 1, 0, 0, ...] (1000 dimensions),
where each vector has 999 zeros followed by a 1.

Answer 2: Cosine Similarity = (U · V) / (||U|| * ||V||) Cosine Similarity = (10 + 01 + 00 + ... + 00) /
(√(1^2 + 0^2 + 0^2 + ... + 0^2) * √(0^2 + 1^2 + 0^2 + ... + 0^2)) Cosine Similarity = 0 / (√1 * √1) Cosine
Similarity = 0

In high dimensions, the cosine similarity between orthogonal vectors is 0.

Question 3: Jaccard Similarity with Sets Calculate the Jaccard similarity for the following pairs of
sets (consider binary data):

 Set A = {1, 2, 3, 4, 5} and Set B = {3, 4, 5, 6, 7}

 Set X = {a, b, c, d, e} and Set Y = {c, d, e, f, g}

Answer 3: Jaccard Similarity = (Number of common elements) / (Total number of distinct elements)

 For Set A and Set B: Jaccard Similarity = (3 common elements {3, 4, 5}) / (7 total distinct
elements {1, 2, 3, 4, 5, 6, 7}) = 3/7

 For Set X and Set Y: Jaccard Similarity = (3 common elements {c, d, e}) / (7 total distinct
elements {a, b, c, d, e, f, g}) = 3/7

The Jaccard similarity for both pairs of sets is 3/7.

### Discretization
Question 1: Equal Width Discretization Suppose you have a dataset of ages (in
years) for a group of people: [22, 25, 30, 35, 40, 45, 50, 55, 60, 65]. Perform
equal-width discretization into three bins. What are the boundaries of these
bins?
Answer 1: To perform equal-width discretization, divide the range of ages into
three equal-width intervals:
1. 22 - 35
2. 36 - 50
3. 51 - 65
So, the boundaries of the bins are [22, 35], [36, 50], and [51, 65].
Question 2: Equal Frequency Discretization Given a dataset of exam scores for
a class: [60, 65, 70, 75, 80, 85, 90, 95], perform equal-frequency discretization
into three bins. What are the boundaries of these bins?
Answer 2: To perform equal-frequency discretization, divide the data into three
bins such that each bin contains an equal number of data points. Since there are
8 data points, each bin should contain 8 / 3 ≈ 2.67 data points (rounded up to 3).
1. [60, 70]
2. [75, 85]
3. [90, 95]
The boundaries of the bins are [60, 70], [75, 85], and [90, 95].
Question 3: Entropy-Based Discretization Suppose you have a dataset of
temperatures (in degrees Celsius) for a month: [10, 12, 14, 16, 18, 20, 22, 24,
26, 28]. Use entropy-based discretization to split the data into two bins. What
are the boundaries of these bins?
Answer 3: Entropy-based discretization aims to minimize the entropy of the
resulting bins. The algorithm will try to find the best split point that minimizes
entropy.

Entropy for a set S is calculated as:)Entropy(S)=−p1∗log2(p1)−p2∗log2(p2)

Where p1 is the proportion of data points in one class, and p2 is the proportion
in the other class.
Entropy for the entire dataset: ([10,12,14,16,18,20,22,24,26,28])
Entropy([10,12,14,16,18,20,22,24,26,28])
Now, calculate the entropy for different split points and find the one that
minimizes entropy. The split point that minimizes entropy is 20. So, the
boundaries of the bins are [10, 20] and [22, 28].
These questions demonstrate different discretization techniques and how to
calculate bin boundaries based on various criteria.

#### Binning

Question: You have a dataset of 1000 exam scores ranging from 0 to 100. How would you
create bins for this data using equal width binning with 5 bins?
Answer: To create equal width bins, you can calculate the bin width as (Max Value - Min
Value) / Number of Bins. In this case, it's (100 - 0) / 5 = 20. So, the bins would be:
 Bin 1: 0-19
 Bin 2: 20-39
 Bin 3: 40-59
 Bin 4: 60-79
 Bin 5: 80-100
2. Question: Given a dataset of ages for a population, how would you create bins using
equal frequency binning with 4 bins?
Answer: Equal frequency binning divides the data into bins such that each bin contains
approximately the same number of data points. Here's how you can do it:
 Sort the ages in ascending order.
 Determine the quartiles (Q1, Q2, Q3).
 Divide the data into 4 bins using these quartiles:
 Bin 1: Ages less than or equal to Q1
 Bin 2: Ages between Q1 and Q2
 Bin 3: Ages between Q2 and Q3
 Bin 4: Ages greater than or equal to Q3
3. Question: You have a dataset of the weights of 200 apples in grams. How would you
create bins using custom binning with specific weight ranges (e.g., 100-150g, 151-
200g, 201-250g)?
Answer: Custom binning allows you to define specific bin ranges. In this case, you can create
bins as follows:
 Bin 1: 100-150g
 Bin 2: 151-200g
 Bin 3: 201-250g
4. Question: You have a dataset of 500 sales transactions, and you want to create bins for
transaction amounts using the natural breaks (Jenks) method. How many bins should
you create?
Answer: To determine the number of bins using the natural breaks (Jenks) method, you
would need to use a statistical algorithm to find the optimal number of bins that minimize
within-bin variance. There isn't a fixed number of bins, as it depends on the data distribution.
You would typically use software or libraries (e.g., Jenks Natural Breaks Classification) to
calculate the optimal number of bins.
5. Question: You have a dataset of 50 temperatures recorded in degrees Celsius. How
would you create bins using quantile binning with 3 bins?
Answer: Quantile binning divides the data into bins based on specified quantiles. To create 3
bins, you can use the following approach:
 Calculate the 33rd percentile (Q1), and 66th percentile (Q2).
 Divide the data into 3 bins as follows:
 Bin 1: Temperatures less than or equal to Q1
 Bin 2: Temperatures between Q1 and Q2
 Bin 3: Temperatures greater than Q2
## covariance matrix
1. Question: Given two variables, X and Y, with the following data points:
X: [3, 4, 5, 6, 7] Y: [2, 3, 4, 5, 6]
Calculate the covariance between X and Y.
Answer: To calculate the covariance between two variables X and Y, you can use the
formula:
Cov(X, Y) = Σ[(Xᵢ - mean(X)) * (Yᵢ - mean(Y))] / (n - 1)
First, calculate the means:
 Mean(X) = (3 + 4 + 5 + 6 + 7) / 5 = 5
 Mean(Y) = (2 + 3 + 4 + 5 + 6) / 5 = 4
Then, calculate the covariance:
 Cov(X, Y) = [(3 - 5) * (2 - 4) + (4 - 5) * (3 - 4) + (5 - 5) * (4 - 4) + (6 - 5) * (5
- 4) + (7 - 5) * (6 - 4)] / (5 - 1)
 Cov(X, Y) = (-4 + (-1) + 0 + 1 + 4) / 4
 Cov(X, Y) = 0.5
So, the covariance between X and Y is 0.5.
2. Question: Given the following data representing three variables X, Y, and Z:
X: [2, 3, 4, 5, 6] Y: [1, 2, 3, 4, 5] Z: [4, 3, 2, 1, 0]
Calculate the covariance matrix for these variables.
Answer: To calculate the covariance matrix for three variables, you need to compute the
covariance between each pair of variables. The covariance matrix is a symmetric matrix
where each element represents the covariance between two variables. Here's how to calculate
it:
 Cov(X, X) = Cov(Y, Y) = Cov(Z, Z) = Variance of each variable
First, calculate the variances for each variable:
 Var(X) = [(2-4)² + (3-4)² + (4-4)² + (5-4)² + (6-4)²] / 4 = 2.5
 Var(Y) = [(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²] / 4 = 2.5
 Var(Z) = [(4-4)² + (3-4)² + (2-4)² + (1-4)² + (0-4)²] / 4 = 5
Now, calculate the covariances:
 Cov(X, Y) = [(2-4)(1-3) + (3-4)(2-3) + (4-4)(3-3) + (5-4)(4-3) + (6-4)*(5-3)] /
4 = -2.5
 Cov(X, Z) = [(2-4)(4-4) + (3-4)(3-4) + (4-4)(2-4) + (5-4)(1-4) + (6-4)*(0-4)] /
4 = -3.5
 Cov(Y, Z) = [(1-3)(4-4) + (2-3)(3-4) + (3-3)(2-4) + (4-3)(1-4) + (5-3)*(0-4)] /
4 = -3.5
| Var(X) Cov(X, Y) Cov(X, Z) |
| Cov(Y, X) Var(Y) Cov(Y, Z) |
| Cov(Z, X) Cov(Z, Y) Var(Z) |

| 2.5 -2.5 -3.5 | | -2.5 2.5 -3.5 | | -3.5 -3.5 5 |

3. Question: If the covariance between two variables X and Y is -0.8, what does it imply
about their relationship?
Answer: A covariance of -0.8 indicates a negative linear relationship between variables X and
Y. This means that as the values of X increase, the values of Y tend to decrease, and vice
versa. It suggests an inverse relationship between the two variables.
I hope these numerical questions and answers help you understand covariance matrices
better!

Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Building An NLP Chatbot For A Restaurant
No ratings yet
Building An NLP Chatbot For A Restaurant
30 pages
Does Generative AI Erode Its Own Training Data? - Empirical Evidence of The Effects On Data Quantity and Characteristics From A Q&A Platform
No ratings yet
Does Generative AI Erode Its Own Training Data? - Empirical Evidence of The Effects On Data Quantity and Characteristics From A Q&A Platform
40 pages
NLP Syllabus
No ratings yet
NLP Syllabus
7 pages
1Z0-1127-24 (OCI Generative AI Professional)
100% (5)
1Z0-1127-24 (OCI Generative AI Professional)
19 pages
Clustering Distance Measures
No ratings yet
Clustering Distance Measures
10 pages
Cluster Analysis: Biological Data Analysis and Chemometrics
No ratings yet
Cluster Analysis: Biological Data Analysis and Chemometrics
41 pages
Information Theory Fundamentals: Distance Between Two Images Based On Pixels
No ratings yet
Information Theory Fundamentals: Distance Between Two Images Based On Pixels
24 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
23 pages
NLP Assignment-7 Solution
No ratings yet
NLP Assignment-7 Solution
5 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Fuzzy Logic-Driven Natural Language Processing in Pharma Supply Chain Analytics
No ratings yet
Fuzzy Logic-Driven Natural Language Processing in Pharma Supply Chain Analytics
22 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Lecture 2 CS602 Divide and Conquer I Closest Pair Distance FINAL VERSION
No ratings yet
Lecture 2 CS602 Divide and Conquer I Closest Pair Distance FINAL VERSION
34 pages
NoteSCK3483 7b Clustering
No ratings yet
NoteSCK3483 7b Clustering
24 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Anshuman Notes Oct16
No ratings yet
Anshuman Notes Oct16
24 pages
Chapter
100% (1)
Chapter
101 pages
Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
No ratings yet
Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
19 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
152 pages
Ontology Evaluation For Reuse in The Domain of Process Systems Engineering
No ratings yet
Ontology Evaluation For Reuse in The Domain of Process Systems Engineering
11 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Protractor: A Fast and Accurate Gesture Recognizer: Google Research 1600 Amphitheatre Parkway Mountain View, CA 94043
No ratings yet
Protractor: A Fast and Accurate Gesture Recognizer: Google Research 1600 Amphitheatre Parkway Mountain View, CA 94043
4 pages
ML
No ratings yet
ML
8 pages
Unit - 3 Image Proc
No ratings yet
Unit - 3 Image Proc
71 pages
Transportation Analytics in The Era of Big Data: Satish V. Ukkusuri Chao Yang Editors
No ratings yet
Transportation Analytics in The Era of Big Data: Satish V. Ukkusuri Chao Yang Editors
240 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Pranav Data Science Lab
No ratings yet
Pranav Data Science Lab
34 pages
Understanding Vector Embeddings
No ratings yet
Understanding Vector Embeddings
14 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Using Knowledge Graphs To Explain Entity Co-Occurrence in
No ratings yet
Using Knowledge Graphs To Explain Entity Co-Occurrence in
4 pages
Chatterjee 2015
No ratings yet
Chatterjee 2015
13 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Retrieval Engine Documentation
No ratings yet
Retrieval Engine Documentation
2 pages
IDS4
No ratings yet
IDS4
50 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Cluster
No ratings yet
Cluster
13 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
Exercises Data Mining
No ratings yet
Exercises Data Mining
5 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Clustering
0% (1)
Clustering
127 pages
Paper 179
No ratings yet
Paper 179
6 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
RL3.2 Data Similarity 1
No ratings yet
RL3.2 Data Similarity 1
17 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Data Mining 4545
No ratings yet
Data Mining 4545
20 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Fourth: Aeideirhelnnom
No ratings yet
Fourth: Aeideirhelnnom
9 pages
Lab File Instructions
No ratings yet
Lab File Instructions
1 page
FINAL CIA 2 IP Detention
No ratings yet
FINAL CIA 2 IP Detention
2 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
DM-Excercise 1A
No ratings yet
DM-Excercise 1A
2 pages
Similarity
No ratings yet
Similarity
19 pages
OCI Answers
No ratings yet
OCI Answers
14 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
A Comparative Study On Distance Measuring Approach
No ratings yet
A Comparative Study On Distance Measuring Approach
3 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
FOA Project Report: Basic Conversational Chatbot - Robo
No ratings yet
FOA Project Report: Basic Conversational Chatbot - Robo
10 pages
Filmview: A Review Paper On Movie Recommendation Systems: © JUN 2023 - IRE Journals - Volume 6 Issue 12 - ISSN: 2456-8880
No ratings yet
Filmview: A Review Paper On Movie Recommendation Systems: © JUN 2023 - IRE Journals - Volume 6 Issue 12 - ISSN: 2456-8880
6 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Text Summerizer Synopsis-1
No ratings yet
Text Summerizer Synopsis-1
6 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Lab 2
No ratings yet
Lab 2
21 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Important Questions Related To Module-1 & Module-2
No ratings yet
Important Questions Related To Module-1 & Module-2
2 pages
Similarity
No ratings yet
Similarity
20 pages
Hw2 Solution
No ratings yet
Hw2 Solution
5 pages
Dist
No ratings yet
Dist
14 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
4 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
K-Means Clustering Tutorial - Matlab Code
No ratings yet
K-Means Clustering Tutorial - Matlab Code
3 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Analytic Geometry: Graphic Solutions Using Matlab Language
From Everand
Analytic Geometry: Graphic Solutions Using Matlab Language
Ing. Mario Castillo
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining Formula

Uploaded by

Data Mining Formula

Uploaded by

In statistics, proximity measures are used to quantify the degree of similarity or dissimilarity

 Formula: √((x2 - x1)^2 + (y2 - y1)^2)

 Calculation: √((5 - 2)^2 + (7 - 3)^2) = √(3^2 + 4^2) = √(9 + 16) = √25 = 5

2. Manhattan Distance (L1 Distance):

 Formula: |x2 - x1| + |y2 - y1|

 Calculation: |5 - 2| + |7 - 3| = |3| + |4| = 3 + 4 = 7

3. Minkowski Distance (Generalized Distance):

 Formula: (∑(|xi - yi|^p))^(1/p)

 Calculation: ((|2 - 5|^3) + (|3 - 7|^3))^(1/3) = (|-3|^3 + |-4|^3)^(1/3) = (27 +

 Formula: (A · B) / (||A|| * ||B||)

5. Jaccard Similarity (for binary data):

 Formula: (Number of common elements) / (Total number of distinct elements)

Questions for practice

 Point A (3, 4) and Point B (7, 9)

 Point P (2, 3) and Point Q (6, 8)

 Point M (1, 2, 3) and Point N (4, 5, 6)

 Point A (3, 4) and Point B (7, 9)

 Point X (1, 2, 3) and Point Y (4, 5, 6)

 Vector U (2, 4, 1) and Vector V (1, 3, 5)

 Vector W (0, 1, 2, 3) and Vector Z (1, 2, 3, 4)

 Set S1 = {1, 2, 3} and Set S2 = {2, 3, 4}

 Set S3 = {A, B, C} and Set S4 = {B, C, D, E}

The Euclidean distance between Point A and Point B is 5 units.

Answer 2: Manhattan Distance = |x2 - x1| + |y2 - y1| Manhattan Distance = |6 - 3| + |5 - 2|

The Manhattan distance between Point X and Point Y is 6 units.

The cosine similarity between Vector U and Vector V is 11/25.

some more challenging numerical questions related to proximity measures in

In high dimensions, the cosine similarity between orthogonal vectors is 0.

 Set A = {1, 2, 3, 4, 5} and Set B = {3, 4, 5, 6, 7}

 Set X = {a, b, c, d, e} and Set Y = {c, d, e, f, g}

The Jaccard similarity for both pairs of sets is 3/7.

Entropy for a set S is calculated as:)Entropy(S)=−p1∗log2(p1)−p2∗log2(p2)

| 2.5 -2.5 -3.5 | | -2.5 2.5 -3.5 | | -3.5 -3.5 5 |

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Data Mining Formula

Uploaded by

Data Mining Formula

Uploaded by

In statistics, proximity measures are used to quantify the degree of similarity or dissimilarity

 Formula: √((x2 - x1)^2 + (y2 - y1)^2)

 Calculation: √((5 - 2)^2 + (7 - 3)^2) = √(3^2 + 4^2) = √(9 + 16) = √25 = 5

2. Manhattan Distance (L1 Distance):

 Formula: |x2 - x1| + |y2 - y1|

 Calculation: |5 - 2| + |7 - 3| = |3| + |4| = 3 + 4 = 7

3. Minkowski Distance (Generalized Distance):

 Formula: (∑(|xi - yi|^p))^(1/p)

 Calculation: ((|2 - 5|^3) + (|3 - 7|^3))^(1/3) = (|-3|^3 + |-4|^3)^(1/3) = (27 +

 Formula: (A · B) / (||A|| * ||B||)

5. Jaccard Similarity (for binary data):

 Formula: (Number of common elements) / (Total number of distinct elements)

Questions for practice

 Point A (3, 4) and Point B (7, 9)

 Point P (2, 3) and Point Q (6, 8)

 Point M (1, 2, 3) and Point N (4, 5, 6)

 Point A (3, 4) and Point B (7, 9)

 Point X (1, 2, 3) and Point Y (4, 5, 6)

 Vector U (2, 4, 1) and Vector V (1, 3, 5)

 Vector W (0, 1, 2, 3) and Vector Z (1, 2, 3, 4)

 Set S1 = {1, 2, 3} and Set S2 = {2, 3, 4}

 Set S3 = {A, B, C} and Set S4 = {B, C, D, E}

The Euclidean distance between Point A and Point B is 5 units.

Answer 2: Manhattan Distance = |x2 - x1| + |y2 - y1| Manhattan Distance = |6 - 3| + |5 - 2|

The Manhattan distance between Point X and Point Y is 6 units.

The cosine similarity between Vector U and Vector V is 11/25.

some more challenging numerical questions related to proximity measures in

In high dimensions, the cosine similarity between orthogonal vectors is 0.

 Set A = {1, 2, 3, 4, 5} and Set B = {3, 4, 5, 6, 7}

 Set X = {a, b, c, d, e} and Set Y = {c, d, e, f, g}

The Jaccard similarity for both pairs of sets is 3/7.

Entropy for a set S is calculated as:)Entropy(S)=−p1​∗log2(p1​)−p2​∗log2(p2​)

| 2.5 -2.5 -3.5 | | -2.5 2.5 -3.5 | | -3.5 -3.5 5 |

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Entropy for a set S is calculated as:)Entropy(S)=−p1∗log2(p1)−p2∗log2(p2)