Set and Sequence Distances metrics
1. Jaccard Distance
Jaccard Similarity is a measure of similarity between two asymmetric binary vectors or between two sets.
It is denoted by
Formula:
It calculates the size of the intersection between two sets divided by the size of their union.
1. Jaccard Similarity
Range: The index ranges from 0 to 1. Range closer to 1 means more similarity in two sets of data.
2. Jaccard Distance
Example
Suppose we have two sets:
- Set A =
- Set B =
Application
Jaccard Distance is primarily used in clustering algorithms and anomaly detection systems that rely on distance matrices.
- Hierarchical Clustering: To group data points based on dissimilarity, placing points with fewer common elements farther apart.
- Density-Based Clustering (e.g., DBSCAN): As a valid metric satisfying the triangle inequality, making it suitable for defining neighborhood distances in sparse, high-dimensional binary data.
- Anomaly Detection: To identify outliers by finding data points with high dissimilarity (distance) from the majority of other points in the dataset.
Advantages
- It is simple to calculate.
- It is not affected by the size of dataset.
- It is a reliable option for determining similarity when the sets are of various sizes.
- It may be used to compare the similarity of any kind of data. (e.g time series data, photos, text, and images)
2. Hamming Distance
The Hamming Distance simply counts the number of positions at which the corresponding symbols are different.
One-Hot Encoding and Hamming Distance
One-Hot Encoding is a technique used to convert categorical variables into a binary format (0s and 1s) so machine learning algorithms can process them. Each unique category gets its own column, and for any given row, exactly one column is "hot" (set to 1) while all others are "cold" (set to 0).
The Scenario: Fruit Classification
Imagine we have a dataset with a Fruit feature that can be Apple, Banana, or Cherry. When we apply One-Hot Encoding, the data transforms like this:
Applying Hamming Distance
Because One-Hot Encoding forces every vector to have identical lengths, it is perfectly suited for Hamming Distance. Calculating "Hamming Distance" between Apple and Banana:
- Position 1: 1 vs 0 (Different
1) - Position 2: 0 vs 1 (Different
1) - Position 3: 0 vs 0 (Same
0)
- The Hamming Distance is 2.
Application
- Machine Learning
- Modes Clustering: A variation of the popular K-means algorithm designed specifically for categorical data. It uses Hamming distance to group similar data points together.
- Other applications
- Checksum and Parity Check Algorithms
- Recommendation Systems:
- Error detection,
- Hashing and Fuzzy Joins
- Bioinformatics
3. Levenshtein
Levenshtein distance is used to measure the difference between two sequences/vectors, and it calculates the minimum number of insertions, deletions, or substitutions required to transform one sequence into another.
As this sounds a bit similar to Hamming Distance, lets see the difference with example
Hamming vs Levenshtein Distance
| Feature | Hamming Distance | Levenshtein Distance |
|---|---|---|
| Core Definition | Measures the number of positions where two strings of equal length differ. | Measures the minimum number of single-character edits needed to change one string into another. |
| Allowed Operations | Substitution only. | Insertion, Deletion, and Substitution. |
| String Length | Strings must be the exact same length. (Fails otherwise). | Strings can be of varying lengths. |
| Handling of "Shifts" | Highly sensitive. If you insert a letter at the start of a word, it misaligns the rest of the string, resulting in a massive error score. | Highly robust. It recognizes shifted characters and handles them via a single insertion or deletion. |
| Algorithmic Complexity | ||
| Best Use Cases | Error detection in fixed-length network packets, cryptography, binary data transmission, SNP analysis in genetics. | Spell checkers, fuzzy searching (e.g., matching "Jon" to "John"), deduplication of dirty database records, NLP. |
Examples
| Example | Levenshtein distance | Hamming Distance |
|---|---|---|
| A="GHOST" B="HOSTS" |
Hamming Distance: 5 Hamming compares the words strictly by their position: 1. G vs H (Different) 2. H vs O (Different) 3. O vs S (Different) 4. S vs T (Different) 5. T vs S (Different) |
Levenshtein Distance: 2 Levenshtein looks at the overall sequence and finds the most efficient way to edit one into the other: 1. Delete the "G" at the beginning (Leaves "HOST"). 2. Insert an "S" at the end (Creates "HOSTS"). It correctly recognizes that the core of the word ("HOST") is completely intact. |
| A="DATA" B="DATABASE" |
Undefined (Error) Hamming distance requires both strings to be the exact same length. |
Levenshtein Distance: 4 Levenshtein handles different lengths effortlessly by using insertions or deletions. 1. Insert "B" 2. Insert "A" 3. Insert "S" 4. Insert "E" |
| A="READ" B="ROAD" |
Hamming Distance: 1 | Levenshtein Distance: 1 |
| Useage | When you are comparing fixed-length codes (like binary sequences or fixed-length genetic strings) where position and length matters absolutely | Natural human text, typos, and strings of varying lengths where characters might have been added or missed |