Set and Sequence Distances metrics

1. Jaccard Distance

Jaccard Similarity is a measure of similarity between two asymmetric binary vectors or between two sets.

It is denoted by $J$ and it is also known as Jaccard Index, Jaccard Coefficient, Jaccard Dissimilarity, and Jaccard Distance.

Formula:

It calculates the size of the intersection between two sets divided by the size of their union.

1. Jaccard Similarity

J (A, B) = \frac{(number of observations in both sets)}{(number in either set)} = \frac{| A \cap B |}{| A \cup B |}

Range: The index ranges from 0 to 1. Range closer to 1 means more similarity in two sets of data.

2. Jaccard Distance

$Jaccard distance = 1 - Jaccard Similarity$

J_{D} (A, B) = 1 - \frac{| A \cap B |}{| A \cup B |}

Example

Suppose we have two sets:

Set A =
Set B =
$A \cap B = {A, C, E, G} ∴ C o u n t = 4$
$A \cup B = {A, B, C, D, E, F, G, H, I} ∴ C o u n t = 9$
$J (A, B) = \frac{| A \cap B |}{| A \cup B |} = \frac{4}{9} = 0.44$
$J_{D} (A, B) = 1 - J (A, B) = 0.56$

Application

Jaccard Distance is primarily used in clustering algorithms and anomaly detection systems that rely on distance matrices.

Hierarchical Clustering: To group data points based on dissimilarity, placing points with fewer common elements farther apart.
Density-Based Clustering (e.g., DBSCAN): As a valid metric satisfying the triangle inequality, making it suitable for defining neighborhood distances in sparse, high-dimensional binary data.
Anomaly Detection: To identify outliers by finding data points with high dissimilarity (distance) from the majority of other points in the dataset.

Advantages

It is simple to calculate.
It is not affected by the size of dataset.
It is a reliable option for determining similarity when the sets are of various sizes.
It may be used to compare the similarity of any kind of data. (e.g time series data, photos, text, and images)

2. Hamming Distance

The Hamming Distance simply counts the number of positions at which the corresponding symbols are different.

One-Hot Encoding and Hamming Distance

One-Hot Encoding is a technique used to convert categorical variables into a binary format (0s and 1s) so machine learning algorithms can process them. Each unique category gets its own column, and for any given row, exactly one column is "hot" (set to 1) while all others are "cold" (set to 0).

The Scenario: Fruit Classification

Imagine we have a dataset with a Fruit feature that can be Apple, Banana, or Cherry. When we apply One-Hot Encoding, the data transforms like this:

$Apple = [1, 0, 0]$
$Banana = [0, 1, 0]$
$Cherry = [0, 0, 1]$

Applying Hamming Distance

Because One-Hot Encoding forces every vector to have identical lengths, it is perfectly suited for Hamming Distance. Calculating "Hamming Distance" between Apple and Banana:

Position 1: 1 vs 0 (Different $\to$ 1)
Position 2: 0 vs 1 (Different $\to$ 1)
Position 3: 0 vs 0 (Same $\to$ 0)

The Hamming Distance is 2.

Application

Machine Learning
- Modes Clustering: A variation of the popular K-means algorithm designed specifically for categorical data. It uses Hamming distance to group similar data points together.
Other applications
- Checksum and Parity Check Algorithms
- Recommendation Systems:
- Error detection,
- Hashing and Fuzzy Joins
- Bioinformatics

3. Levenshtein

Levenshtein distance is used to measure the difference between two sequences/vectors, and it calculates the minimum number of insertions, deletions, or substitutions required to transform one sequence into another.

As this sounds a bit similar to Hamming Distance, lets see the difference with example

Hamming vs Levenshtein Distance

Feature	Hamming Distance	Levenshtein Distance
Core Definition	Measures the number of positions where two strings of equal length differ.	Measures the minimum number of single-character edits needed to change one string into another.
Allowed Operations	Substitution only.	Insertion, Deletion, and Substitution.
String Length	Strings must be the exact same length. (Fails otherwise).	Strings can be of varying lengths.
Handling of "Shifts"	Highly sensitive. If you insert a letter at the start of a word, it misaligns the rest of the string, resulting in a massive error score.	Highly robust. It recognizes shifted characters and handles them via a single insertion or deletion.
Algorithmic Complexity	$O (n)$ - Very fast. It just scans the strings sequentially position-by-position.	$O (n \times m)$ - Computationally heavier. Requires building a dynamic programming matrix to find the optimal path of edits.
Best Use Cases	Error detection in fixed-length network packets, cryptography, binary data transmission, SNP analysis in genetics.	Spell checkers, fuzzy searching (e.g., matching "Jon" to "John"), deduplication of dirty database records, NLP.

Examples

Example	Levenshtein distance	Hamming Distance
A="GHOST" B="HOSTS"	Hamming Distance: 5 Hamming compares the words strictly by their position: 1. G vs H (Different) 2. H vs O (Different) 3. O vs S (Different) 4. S vs T (Different) 5. T vs S (Different)	Levenshtein Distance: 2 Levenshtein looks at the overall sequence and finds the most efficient way to edit one into the other: 1. Delete the "G" at the beginning (Leaves "HOST"). 2. Insert an "S" at the end (Creates "HOSTS"). It correctly recognizes that the core of the word ("HOST") is completely intact.
A="DATA" B="DATABASE"	Undefined (Error) Hamming distance requires both strings to be the exact same length.	Levenshtein Distance: 4 Levenshtein handles different lengths effortlessly by using insertions or deletions. 1. Insert "B" 2. Insert "A" 3. Insert "S" 4. Insert "E"
A="READ" B="ROAD"	Hamming Distance: 1	Levenshtein Distance: 1
Useage	When you are comparing fixed-length codes (like binary sequences or fixed-length genetic strings) where position and length matters absolutely	Natural human text, typos, and strings of varying lengths where characters might have been added or missed