Compared to the Euclidean distance, the effect of an outlier is dampened as the component differences are not squared. The Manhattan metric is widely used in a variety of data mining algorithms and is justified for those problem domains where it is desired that two data instances should be the same distance when they differ by, say, 4 units along 2 dimensions, as when they differ by 1 unit along one dimension and 3 along another .
The Power Distance is a generalization of the Minkowski metric discussed earlier: .
.
where the points take the form and . The parameter controls the weight given to dominance by larger differences along each dimension. The parameter determines the progressive weight given to larger distances between objects. When , the distance measure is the Minkowski metric: .
.
of which the Euclidean metric is a further special case for , and the Manhattan metric for . .
Another distance measure, often used for text document information retrieval, is the cosine angle distance . This is the cosine of the angle subtended by two points from the origin, subtracted from unity: .
.
The dot product form is particularly simple to compute. This measure is not a metric as it does not satisfy the triangle inequality. In addition, it does not reflect differences in magnitude amongst the components of and y, and so is useful only in specific problem domains. Kormaz proposes a hybrid measure, , which aims to overcome this last limitation . .
Figure illustrates the differences between the Euclidean, Manhattan and cosine angle distances. .
.
The Euclidean , Manhattan and Cosine Distances Compared.
Mahalanobis Measure .
The Mahalanobis distance is a measure between a data instance and the centroid of a set of instances: .
.
where is the data instance, is the centroid of a set of data instances, and is a matrix of covariances for the data set or cluster. By virtue of the covariance matrix, the surface of constant distance is an ellipsoid, or a hyper-ellipsoid for higher dimensional data, to account for differences in the variation of data along each dimension.