Set and Sequence Distances metrics

1. Jaccard Distance

Jaccard Similarity is a measure of similarity between two asymmetric binary vectors or between two sets.

It is denoted by J and it is also known as Jaccard Index, Jaccard Coefficient, Jaccard Dissimilarity, and Jaccard Distance.

Formula:

It calculates the size of the intersection between two sets divided by the size of their union.

1. Jaccard Similarity
J(A,B)=(number of observations in both sets)(number in either set)=|AB||AB|

Range: The index ranges from 0 to 1. Range closer to 1 means more similarity in two sets of data.

2. Jaccard Distance

Jaccard distance = 1 - Jaccard Similarity

JD(A,B)=1|AB||AB|
Example

Suppose we have two sets:

Application

Jaccard Distance is primarily used in clustering algorithms and anomaly detection systems that rely on distance matrices.

Advantages

2. Hamming Distance

The Hamming Distance simply counts the number of positions at which the corresponding symbols are different.

One-Hot Encoding and Hamming Distance

One-Hot Encoding is a technique used to convert categorical variables into a binary format (0s and 1s) so machine learning algorithms can process them. Each unique category gets its own column, and for any given row, exactly one column is "hot" (set to 1) while all others are "cold" (set to 0).

The Scenario: Fruit Classification

Imagine we have a dataset with a Fruit feature that can be Apple, Banana, or Cherry. When we apply One-Hot Encoding, the data transforms like this:

Applying Hamming Distance

Because One-Hot Encoding forces every vector to have identical lengths, it is perfectly suited for Hamming Distance. Calculating "Hamming Distance" between Apple and Banana:

  1. Position 1: 1 vs 0 (Different 1)
  2. Position 2: 0 vs 1 (Different 1)
  3. Position 3: 0 vs 0 (Same 0)

Application

3. Levenshtein

Levenshtein distance is used to measure the difference between two sequences/vectors, and it calculates the minimum number of insertions, deletions, or substitutions required to transform one sequence into another.

As this sounds a bit similar to Hamming Distance, lets see the difference with example

Hamming vs Levenshtein Distance

Feature Hamming Distance Levenshtein Distance
Core Definition Measures the number of positions where two strings of equal length differ. Measures the minimum number of single-character edits needed to change one string into another.
Allowed Operations Substitution only. Insertion, Deletion, and Substitution.
String Length Strings must be the exact same length. (Fails otherwise). Strings can be of varying lengths.
Handling of "Shifts" Highly sensitive. If you insert a letter at the start of a word, it misaligns the rest of the string, resulting in a massive error score. Highly robust. It recognizes shifted characters and handles them via a single insertion or deletion.
Algorithmic Complexity O(n) - Very fast. It just scans the strings sequentially position-by-position. O(n×m) - Computationally heavier. Requires building a dynamic programming matrix to find the optimal path of edits.
Best Use Cases Error detection in fixed-length network packets, cryptography, binary data transmission, SNP analysis in genetics. Spell checkers, fuzzy searching (e.g., matching "Jon" to "John"), deduplication of dirty database records, NLP.

Examples

Example Levenshtein distance Hamming Distance
A="GHOST"
B="HOSTS"
Hamming Distance: 5
Hamming compares the words strictly by their position:
1. G vs H (Different)
2. H vs O (Different)
3. O vs S (Different)
4. S vs T (Different)
5. T vs S (Different)
Levenshtein Distance: 2
Levenshtein looks at the overall sequence and finds the most efficient way to edit one into the other:
1. Delete the "G" at the beginning (Leaves "HOST").
2. Insert an "S" at the end (Creates "HOSTS").
It correctly recognizes that the core of the word ("HOST") is completely intact.
A="DATA"
B="DATABASE"
Undefined (Error)
Hamming distance requires both strings to be the exact same length.
Levenshtein Distance: 4 Levenshtein handles different lengths effortlessly by using insertions or deletions.
1. Insert "B"
2. Insert "A"
3. Insert "S"
4. Insert "E"
A="READ"
B="ROAD"
Hamming Distance: 1 Levenshtein Distance: 1
Useage When you are comparing fixed-length codes (like binary sequences or fixed-length genetic strings) where position and length matters absolutely Natural human text, typos, and strings of varying lengths where characters might have been added or missed