LU1 - Distance Based Models
LU1 - Distance Based Models
•
Hamming distance
• Hamming distance as the number of bits that need to be flipped to
change x into y; for non-binary strings of unequal length this can be
generalised to the notion of edit distance or Levenshtein distance
• Definition 8.2 (Distance metric). Given an instance space X, a distance
metric is a function Dis :X ×X →R such that for any x, y, z ∈X:
• 1. distances between a point and itself are zero: Dis(x,x) = 0;
• 2. all other distances are larger than zero: if x = y then Dis(x, y) > 0;
• 3. distances are symmetric: Dis(y,x) = Dis(x, y);
• 4. detours can not shorten the distance: Dis(x, z) ≤ Dis(x, y)+Dis(y, z).
• Called as triangle inequality
• If the second condition is weakened to a non-strict inequality – i.e., Dis(x, y)
may be zero even if x = y – the function Dis is called a pseudo-metric.
• The last condition is called the triangle inequality
• The green circle connects points the same Euclidean distance (i.e., Minkowski
• distance of order p = 2) away from the origin as A.
• The orange circle shows that B and C are equidistant from A.
• The red circle demonstrates that C is closer to the origin than B, which conforms
to the triangle inequality
• The triangle inequality dictates that the distance from the origin to C is no more than the sum of the
distances from the origin to A (Dis(O,A)) and from A to C (Dis(A, C)).
• B is at the same distance from A as C, regardless of the distance measure used; so Dis(O,A)+Dis(A,C)
is equal to the distance from the origin to B.
• So, if we draw a circle around the origin through B, the triangle inequality dictates that C not be
outside that circle.
• B is the only point where the circles around the origin and around A intersect, so everywhere else
the triangle inequality is a strict inequality.
• With Manhattan distance (p = 1), B and C are equally close to the origin and also
equidistant from A.
• Now, B and C are in fact equidistant from the origin, and so travelling via A to C is no
longer a detour, but just one of the many shortest routes.
• However, if we now decrease p further,we see that C ends up outside the red shape,
and is thus further away than B when seen from the origin, whereas of course the sum
of the distances from the origin to A and
• from A to C is still equal to the distance fromthe origin to B. At this point, our intuition
breaks down:
• Minkowski distances with p < 1 are simply not very useful as distances since they all
violate the triangle inequality.
Elliptical Distances
• it is more realistic to use an ellipse rather than a circle to identify points that can be
reached in a fixed amount of time, with the major axis of the ellipse indicating
directions that can be traversed at larger speed
• Mathematically, while hyper-spheres (circles in d ≥ 2 dimensions) of radius r can be
defined by the equation xTx = r 2, hyper-ellipses are defined by xTMx = r 2
• where M is a matrix describing the appropriate rotation and scaling.
• Shape of the ellipse is estimated from data as the inverse of the covariance matrix: M=
Σ −1.
• This leads to the definition of the Mahalanobis distance