0% found this document useful (0 votes)
7 views5 pages

Formulas at A Glance - IDS

The document provides a comprehensive overview of various statistical formulas and measures, including arithmetic mean, median, mode, variance, standard deviation, and different proximity measures for nominal, ordinal, and numeric attributes. It also covers normalization techniques, correlation coefficients, chi-square tests, Gini index, entropy, and gain ratio. Each formula is presented with its mathematical representation and context for use, making it a valuable reference for statistical analysis.

Uploaded by

nilesh jagdale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Formulas at A Glance - IDS

The document provides a comprehensive overview of various statistical formulas and measures, including arithmetic mean, median, mode, variance, standard deviation, and different proximity measures for nominal, ordinal, and numeric attributes. It also covers normalization techniques, correlation coefficients, chi-square tests, Gini index, entropy, and gain ratio. Each formula is presented with its mathematical representation and context for use, making it a valuable reference for statistical analysis.

Uploaded by

nilesh jagdale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Formulas at a Glance: IDS

Arithmetic Mean 1 n
x =  xi Where, n is number of observations.
n i=1
Weighted arithmetic n

mean:
w x i i
x = i =1
n
Where, wi is the weight of xi
w
i =1
i

Median Estimated by interpolation (for grouped data):


Middle value if odd
number of values, or n / 2 − ( freq ) l
average of the middle two median = L1 + ( ) width
values otherwise freqmedian

Mode
• Value that occurs most Empirical formula:
frequently in the data
• Unimodal, bimodal, mean − mode = 3  (mean − median)
trimodal
𝑛 𝑛
Variance 1 1
𝜎 2 = ∑(𝑥𝑖 − 𝜇)2 = ∑ 𝑥𝑖 2 − 𝜇2
𝑁 𝑁
𝑖=1 𝑖=1
Standard Deviation Square root of variance i.e 𝜎 = √𝜎 2
Skewness is a measure of 𝑁

symmetry (more precisely, 1 (𝑥𝑖 − 𝑥̅ ) 3


𝑠𝑘𝑒𝑤 = ∑ [ ]
the lack of symmetry). 𝑁 𝜎
𝑖=1
Five number summaries:
min Q1 median Q3 max
Quartiles, outliers and • Quartiles:
boxplots Q1 (25th percentile)
Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
Boxplot: ends of the box
are the quartiles; median is
marked; add whiskers, and
plot outliers individually

Outlier: usually, a value higher/lower than 1.5 x IQR (on


both sides of box from Q1 to Q3)

Formulas at a Glance: IDS by Reji 1


Proximity Measure for Method 1: Simple matching
Nominal Attributes m: # of matches, p: total # of variables
𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝
Method 2: Use a large number of binary attributes
creating a new binary attribute for each of the M
nominal states
Proximity Measure for After assigning the rank, normalize these ranks so
Ordinal Attributes that they fall in the range of 0.0 to 1.0.
We can map ranks with the help of the following
formula:
𝑟𝑖𝑓 − 1
𝑍𝑖𝑓 =
𝑚𝑓 − 1
𝑟𝑖𝑓 is a rank type and 𝑚𝑓 represents the total number
of rank types or categories in the Test attribute,
Proximity Measure for 𝑟+𝑠 𝑞
𝑆𝑖𝑚(𝑖, 𝑓) = 1 − 𝑑(𝑖, 𝑗) = 1 − =
Binary Attributes 𝑞+𝑟+𝑠 𝑞+𝑟+𝑠

– q is the number of attributes that equal 1 for both
objects i and j,
– r is the number of attributes that equal 1 for object i
but equal 0 for object j,
– s is the number of attributes that equal 0 for object i
but equal 1 for object j,
– t is the number of attributes that equal 0 for both
objects i and j.
Jaccard The Jaccard similarity of two sets is the size of their
Distance/Similarity intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|

Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|


Proximity Measure for Numeric Attributes
Euclidean distance The Euclidean distance between objects i and j is
defined as

d(i,j) = √(𝑥𝑖1 − 𝑥𝑗1 )2 + (𝑥𝑖2 − 𝑥𝑗2 )2 + ⋯ . . +(𝑥𝑖𝑝 − 𝑥𝑗𝑝 )2

Manhattan distance (or The Manhattan distance between objects i and j is


city block) defined as
d(i,j) = |𝑥𝑖1 − 𝑥𝑗1 | + |𝑥𝑖2 − 𝑥𝑗2 | + ⋯ + |𝑥𝑖𝑝 − 𝑥𝑗𝑝 |

Formulas at a Glance: IDS by Reji 2


Minkowski Distance ℎ
d(i,j) = √|𝑥𝑖1 − 𝑥𝑗1 |ℎ + |𝑥𝑖2 − 𝑥𝑗2 |ℎ + ⋯ . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |ℎ
Minkowski distance is a
generalization of the Where
Euclidean and Manhattan - i = xi1, Xi2, ... , Xip and j = xj1, Xj2, ... , Xjp are two
distances objects described by p numeric attributes
- h is a real number such that h ≥ 1
- It is also called as Lh norm
- When h = 1, it represents the Manhattan distance
(i.e., L1 norm)
- When h = 2, it represents the Euclidean distance
(i.e., L2 norm)
Dissimilarity of Numeric The supremum distance is a generalization of the
Data: Supremum Distance Minkowski distance for h = ∞
- also referred to as Lmax, L∞ norm, and as the
Chebyshev distance
. To compute it, we find the attribute f that gives the
maximum difference in
values between the two objects
. This difference is the supremum distance, defined
more formallyas:

𝑝 1/ℎ
d(i,j) = lim (∑𝑓=1 |𝑥𝑖1 − 𝑥𝑗1 |) = max |𝑥𝑖𝑓 − 𝑥𝑗𝑓 |
ℎ→∞ 𝑓

- The L∞ norm is also known as the uniform norm


Proximity Measure for - Process all attribute types together, performing a
Complex Objects single analysis
- Combine the different attributes into a single
dissimilarity matrix, bringing all of the
meaningful attributes onto a common scale of the
interval [0.0, 1.0]
- Suppose that the data set contains p attributes of
mixed type
- The dissimilarity d(i, j) between objects i and j is
defined as:

∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓) 𝑑𝑖𝑗


(𝑓)

𝑑(𝑖, 𝑗) =
∑𝑝𝑓=1 𝛿𝑖𝑗(𝑓)

Formulas at a Glance: IDS by Reji 3


Attributes of Mixed Type  pf = 1ij( f )dij( f )
d (i, j) =
 pf = 1ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal

• Compute ranks rif and


• Treat zif as interval-scaled
zif =
r −1
if

M −1 f

Cosine Similarity Cosine measure: If d1 and d2 are two vectors


cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| where •
indicates vector dot product,
||d||: the length of vector d
Discretization by Binning 1)Equal Width (distance) binning
Methods

2)Equal Depth (frequency) binning


Specify the number of values that have to be stored in
each bin.
Number of entries in each bin are equal.
Some values can be stored in different bins.
Normalization
Min-Max normalization 𝑥̂ =
𝑥−𝑚𝑖𝑛(𝑥)
for range (0,1)
max (𝑥)−𝑚𝑖𝑛(𝑥)

𝑥 − 𝑚𝑖𝑛(𝑥)
𝑥̂ = (𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 ) + 𝑛𝑒𝑤𝑚𝑖𝑛
max (𝑥) − 𝑚𝑖𝑛(𝑥)
for range (𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 )
z-score normalization 𝑥 − µ(𝑥)
𝑥̂ =
σ (𝑥)
𝑥
Normalization by decimal 𝑥̂ =
10𝑗
scaling
j : smallest integer such that max(|𝑥̂| < 1,
New range is [−1, +1].
Correlation coefficient 𝑛 𝑛
(𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2 (𝑥𝑖 𝑦𝑖 ) − 𝑛𝑥𝑦
̅̅̅
Range : -1 to +1 𝑟𝑥,𝑦 =∑ =∑
𝑛𝜎𝑥 𝜎𝑦 𝑛𝜎𝑥 𝜎𝑦
𝑖=1 𝑖=1

Formulas at a Glance: IDS by Reji 4


𝑛
Covariance (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝐶𝑜𝑣(𝑥, 𝑦) = ∑
𝑛
𝑖=1
χ2 (chi-square) test H0 : If there is no association between the two
variables
H1 : There is a significant association between the
two variables.
Chi Square Test Formula
𝑂𝑖 − 𝐸𝑖
𝜒2 = ∑
𝐸𝑖
Where : 𝑂𝑖 = Observed values
𝐸𝑖 = expected values
Expected (Ei) = Row Total x Column Total/ Grand Total
The chi-square calculated value is less than the chi-
square critical value (Table value), then "fail to
reject"/”accept” the null hypothesis
Gini Index Gini Index for a given node t :
𝐺𝑖𝑛𝑖(𝑡) = 1 − ∑[𝑝(𝑗|𝑡)]2
𝑗
Splitting Based on GINI
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ∑ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
Entropy Entropy at a given node t:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑡) = − ∑ 𝑝(𝑗|𝑡) 𝑙𝑜𝑔 𝑝 (𝑗|𝑡)
𝑗
• Classification error at a node t :
𝐸𝑟𝑟𝑜𝑟(𝑡) = 1 − 𝑚𝑎𝑥 𝑃(𝑖|𝑡)
𝑖
Gain Ratio Expected information (entropy) needed to classify a
tuple in D:
𝑚

𝐼𝑛𝑓𝑜(𝐷) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 ( 𝑝𝑖 )
𝑖=1
Information needed (after using A to split D into v
partitions) to classify D:
𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (𝐷) = ∑ × 𝐼𝑛𝑓𝑜(𝐷𝑗 )
|𝐷|
𝑗=1
Information gained by branching on attribute A
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜𝐴 (𝐷)

Formulas at a Glance: IDS by Reji 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy