0% found this document useful (0 votes)

7 views5 pages

Formulas at A Glance - IDS

The document provides a comprehensive overview of various statistical formulas and measures, including arithmetic mean, median, mode, variance, standard deviation, and different proximity measures for nominal, ordinal, and numeric attributes. It also covers normalization techniques, correlation coefficients, chi-square tests, Gini index, entropy, and gain ratio. Each formula is presented with its mathematical representation and context for use, making it a valuable reference for statistical analysis.

Uploaded by

nilesh jagdale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views5 pages

Formulas at A Glance - IDS

Uploaded by

nilesh jagdale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Formulas at a Glance: IDS

Arithmetic Mean 1 n
x =  xi Where, n is number of observations.
n i=1
Weighted arithmetic n

mean:
w x i i
x = i =1
n
Where, wi is the weight of xi
w
i =1
i

Median Estimated by interpolation (for grouped data):

Middle value if odd
number of values, or n / 2 − ( freq ) l
average of the middle two median = L1 + ( ) width
values otherwise freqmedian

Mode
• Value that occurs most Empirical formula:
frequently in the data
• Unimodal, bimodal, mean − mode = 3  (mean − median)
trimodal
𝑛 𝑛
Variance 1 1
𝜎 2 = ∑(𝑥𝑖 − 𝜇)2 = ∑ 𝑥𝑖 2 − 𝜇2
𝑁 𝑁
𝑖=1 𝑖=1
Standard Deviation Square root of variance i.e 𝜎 = √𝜎 2
Skewness is a measure of 𝑁

symmetry (more precisely, 1 (𝑥𝑖 − 𝑥̅ ) 3

𝑠𝑘𝑒𝑤 = ∑ [ ]
the lack of symmetry). 𝑁 𝜎
𝑖=1
Five number summaries:
min Q1 median Q3 max
Quartiles, outliers and • Quartiles:
boxplots Q1 (25th percentile)
Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
Boxplot: ends of the box
are the quartiles; median is
marked; add whiskers, and
plot outliers individually

Outlier: usually, a value higher/lower than 1.5 x IQR (on

both sides of box from Q1 to Q3)

Formulas at a Glance: IDS by Reji 1

Proximity Measure for Method 1: Simple matching
Nominal Attributes m: # of matches, p: total # of variables
𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝
Method 2: Use a large number of binary attributes
creating a new binary attribute for each of the M
nominal states
Proximity Measure for After assigning the rank, normalize these ranks so
Ordinal Attributes that they fall in the range of 0.0 to 1.0.
We can map ranks with the help of the following
formula:
𝑟𝑖𝑓 − 1
𝑍𝑖𝑓 =
𝑚𝑓 − 1
𝑟𝑖𝑓 is a rank type and 𝑚𝑓 represents the total number
of rank types or categories in the Test attribute,
Proximity Measure for 𝑟+𝑠 𝑞
𝑆𝑖𝑚(𝑖, 𝑓) = 1 − 𝑑(𝑖, 𝑗) = 1 − =
Binary Attributes 𝑞+𝑟+𝑠 𝑞+𝑟+𝑠
–
– q is the number of attributes that equal 1 for both
objects i and j,
– r is the number of attributes that equal 1 for object i
but equal 0 for object j,
– s is the number of attributes that equal 0 for object i
but equal 1 for object j,
– t is the number of attributes that equal 0 for both
objects i and j.
Jaccard The Jaccard similarity of two sets is the size of their
Distance/Similarity intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|

Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|

Proximity Measure for Numeric Attributes
Euclidean distance The Euclidean distance between objects i and j is
defined as

d(i,j) = √(𝑥𝑖1 − 𝑥𝑗1 )2 + (𝑥𝑖2 − 𝑥𝑗2 )2 + ⋯ . . +(𝑥𝑖𝑝 − 𝑥𝑗𝑝 )2

Manhattan distance (or The Manhattan distance between objects i and j is

Formulas at a Glance: IDS by Reji 2

Minkowski Distance ℎ
d(i,j) = √|𝑥𝑖1 − 𝑥𝑗1 |ℎ + |𝑥𝑖2 − 𝑥𝑗2 |ℎ + ⋯ . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |ℎ
Minkowski distance is a
generalization of the Where
Euclidean and Manhattan - i = xi1, Xi2, ... , Xip and j = xj1, Xj2, ... , Xjp are two
distances objects described by p numeric attributes
- h is a real number such that h ≥ 1
- It is also called as Lh norm
- When h = 1, it represents the Manhattan distance
(i.e., L1 norm)
- When h = 2, it represents the Euclidean distance
(i.e., L2 norm)
Dissimilarity of Numeric The supremum distance is a generalization of the
Data: Supremum Distance Minkowski distance for h = ∞
- also referred to as Lmax, L∞ norm, and as the
Chebyshev distance
. To compute it, we find the attribute f that gives the
maximum difference in
values between the two objects
. This difference is the supremum distance, defined
more formallyas:

𝑝 1/ℎ
d(i,j) = lim (∑𝑓=1 |𝑥𝑖1 − 𝑥𝑗1 |) = max |𝑥𝑖𝑓 − 𝑥𝑗𝑓 |
ℎ→∞ 𝑓

- The L∞ norm is also known as the uniform norm

Proximity Measure for - Process all attribute types together, performing a
Complex Objects single analysis
- Combine the different attributes into a single
dissimilarity matrix, bringing all of the
meaningful attributes onto a common scale of the
interval [0.0, 1.0]
- Suppose that the data set contains p attributes of
mixed type
- The dissimilarity d(i, j) between objects i and j is
defined as:

∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗

(𝑓)

𝑑(𝑖, 𝑗) =
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓)

Formulas at a Glance: IDS by Reji 3

Attributes of Mixed Type  pf = 1ij( f )dij( f )
d (i, j) =
 pf = 1ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal

• Compute ranks rif and

• Treat zif as interval-scaled
zif =
r −1
if

M −1 f

Cosine Similarity Cosine measure: If d1 and d2 are two vectors

cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| where •
indicates vector dot product,
||d||: the length of vector d
Discretization by Binning 1)Equal Width (distance) binning
Methods

2)Equal Depth (frequency) binning

Specify the number of values that have to be stored in
each bin.
Number of entries in each bin are equal.
Some values can be stored in different bins.
Normalization
Min-Max normalization 𝑥̂ =
𝑥−𝑚𝑖𝑛(𝑥)
for range (0,1)
max (𝑥)−𝑚𝑖𝑛(𝑥)

𝑥 − 𝑚𝑖𝑛(𝑥)
𝑥̂ = (𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 ) + 𝑛𝑒𝑤𝑚𝑖𝑛
max (𝑥) − 𝑚𝑖𝑛(𝑥)
for range (𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 )
z-score normalization 𝑥 − µ(𝑥)
𝑥̂ =
σ (𝑥)
𝑥
Normalization by decimal 𝑥̂ =
10𝑗
scaling
j : smallest integer such that max(|𝑥̂| < 1,
New range is [−1, +1].
Correlation coefficient 𝑛 𝑛
(𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 𝑦𝑖 ) − 𝑛𝑥𝑦
̅̅̅
Range : -1 to +1 𝑟𝑥,𝑦 =∑ =∑
𝑛𝜎𝑥 𝜎𝑦 𝑛𝜎𝑥 𝜎𝑦
𝑖=1 𝑖=1

Formulas at a Glance: IDS by Reji 4

𝑛
Covariance (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝐶𝑜𝑣(𝑥, 𝑦) = ∑
𝑛
𝑖=1
χ2 (chi-square) test H0 : If there is no association between the two
variables
H1 : There is a significant association between the
two variables.
Chi Square Test Formula
𝑂𝑖 − 𝐸𝑖
𝜒2 = ∑
𝐸𝑖
Where : 𝑂𝑖 = Observed values
𝐸𝑖 = expected values
Expected (Ei) = Row Total x Column Total/ Grand Total
The chi-square calculated value is less than the chi-
square critical value (Table value), then "fail to
reject"/”accept” the null hypothesis
Gini Index Gini Index for a given node t :
𝐺𝑖𝑛𝑖(𝑡) = 1 − ∑[𝑝(𝑗|𝑡)]2
𝑗
Splitting Based on GINI
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ∑ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
Entropy Entropy at a given node t:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − ∑ 𝑝(𝑗|𝑡) 𝑙𝑜𝑔 𝑝 (𝑗|𝑡)
𝑗
• Classification error at a node t :
𝐸𝑟𝑟𝑜𝑟(𝑡) = 1 − 𝑚𝑎𝑥 𝑃(𝑖|𝑡)
𝑖
Gain Ratio Expected information (entropy) needed to classify a
tuple in D:
𝑚

𝐼𝑛𝑓𝑜(𝐷) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 ( 𝑝𝑖 )
𝑖=1
Information needed (after using A to split D into v
partitions) to classify D:
𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (𝐷) = ∑ × 𝐼𝑛𝑓𝑜(𝐷𝑗 )
|𝐷|
𝑗=1
Information gained by branching on attribute A
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜𝐴 (𝐷)

Formulas at a Glance: IDS by Reji 5

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
Clustering
No ratings yet
Clustering
104 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Rule Ch2
No ratings yet
Rule Ch2
12 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Clustering
No ratings yet
Clustering
104 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Margin 6794edf99eb1f 6794ede66a47f
No ratings yet
Margin 6794edf99eb1f 6794ede66a47f
2 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
02data Part4
No ratings yet
02data Part4
28 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Evaluation of Similarity Measurement For Image Retrieval
No ratings yet
Evaluation of Similarity Measurement For Image Retrieval
4 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Lec 5
No ratings yet
Lec 5
24 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
PTS Syllabus
100% (1)
PTS Syllabus
6 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Cheatsheet FDA A4 Full
No ratings yet
Cheatsheet FDA A4 Full
2 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Similarity
No ratings yet
Similarity
19 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Snake Game Design Document
No ratings yet
Snake Game Design Document
5 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Balanceo PCC v7
100% (1)
Balanceo PCC v7
4 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
T1000 Minerva
No ratings yet
T1000 Minerva
18 pages
DBMS MCQ
No ratings yet
DBMS MCQ
17 pages
Artificial Intelligence-Based Tools
No ratings yet
Artificial Intelligence-Based Tools
33 pages
System Concepts
100% (6)
System Concepts
4 pages
Lesson 4 - Advanced Technologies in Accounting Information: I. Traditional Approaches: User-View Orientation
No ratings yet
Lesson 4 - Advanced Technologies in Accounting Information: I. Traditional Approaches: User-View Orientation
8 pages
30 - Linux Shell Interview Questions For Beginners With Answers
No ratings yet
30 - Linux Shell Interview Questions For Beginners With Answers
7 pages
IDS Notes Part1
No ratings yet
IDS Notes Part1
29 pages
Introduction To AngularJS
No ratings yet
Introduction To AngularJS
74 pages
COSS - 2022-23 Question Paper
No ratings yet
COSS - 2022-23 Question Paper
6 pages
Nekobin
No ratings yet
Nekobin
2 pages
IDS Notes Part2
No ratings yet
IDS Notes Part2
19 pages
Aya-5 User-Manual Reva
No ratings yet
Aya-5 User-Manual Reva
8 pages
Cryptohost - Prime Clients
No ratings yet
Cryptohost - Prime Clients
2 pages
Cisco Meeting Server Single Server Simplified Setup Guide 3 1
No ratings yet
Cisco Meeting Server Single Server Simplified Setup Guide 3 1
48 pages
ICO Crowd Magazine, Issue One, September 2017
No ratings yet
ICO Crowd Magazine, Issue One, September 2017
112 pages
COSS End Sem Paper
No ratings yet
COSS End Sem Paper
5 pages
FlatScan-15 DS
No ratings yet
FlatScan-15 DS
2 pages
Manual 1209203 Parrot Asteroid Smart
No ratings yet
Manual 1209203 Parrot Asteroid Smart
128 pages
Important COSS Formula
No ratings yet
Important COSS Formula
5 pages
STA301 Quiz 3 Finals 11-01-2024 Mam Mehwish
No ratings yet
STA301 Quiz 3 Finals 11-01-2024 Mam Mehwish
19 pages
Literature Review Record Management System
100% (2)
Literature Review Record Management System
4 pages
EC - A1P - Language Test 3B
No ratings yet
EC - A1P - Language Test 3B
4 pages
What Is A Domain Name
No ratings yet
What Is A Domain Name
2 pages
Vizio M550SV LCD TV User Manual
No ratings yet
Vizio M550SV LCD TV User Manual
53 pages
Launching of Servizing App
No ratings yet
Launching of Servizing App
4 pages
App Service Plans
No ratings yet
App Service Plans
6 pages
Unit-1 Iot
No ratings yet
Unit-1 Iot
24 pages
FHull What Is It
No ratings yet
FHull What Is It
2 pages
29.11.2024 FN Seating
No ratings yet
29.11.2024 FN Seating
4 pages
DBMS Assignment 2
No ratings yet
DBMS Assignment 2
9 pages
GFDH
No ratings yet
GFDH
2 pages
2 AllRecord Page HTML Code
No ratings yet
2 AllRecord Page HTML Code
2 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Formulas at A Glance - IDS

Uploaded by

Formulas at A Glance - IDS

Uploaded by

Formulas at a Glance: IDS

Median Estimated by interpolation (for grouped data):

symmetry (more precisely, 1 (𝑥𝑖 − 𝑥̅ ) 3

Outlier: usually, a value higher/lower than 1.5 x IQR (on

Formulas at a Glance: IDS by Reji 1

Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|

d(i,j) = √(𝑥𝑖1 − 𝑥𝑗1 )2 + (𝑥𝑖2 − 𝑥𝑗2 )2 + ⋯ . . +(𝑥𝑖𝑝 − 𝑥𝑗𝑝 )2

Manhattan distance (or The Manhattan distance between objects i and j is

Formulas at a Glance: IDS by Reji 2

- The L∞ norm is also known as the uniform norm

∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗

Formulas at a Glance: IDS by Reji 3

• Compute ranks rif and

Cosine Similarity Cosine measure: If d1 and d2 are two vectors

2)Equal Depth (frequency) binning

Formulas at a Glance: IDS by Reji 4

Formulas at a Glance: IDS by Reji 5

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.