Understanding the Confusion Matrix and Model Evaluation Metrics

The Confusion Matrix is the cornerstone of model evaluation for classification problems. It gives us a detailed breakdown of how our model's predictions compare to the actual outcomes.

The Confusion Matrix

cm-1.png

Image source: https://encord.com/

The confusion matrix is a table that summarizes the performance of a classification algorithm.

Example: COVID-19 Testing

Let's use a real-world example to make this concrete. Imagine we have a model that predicts whether a patient has COVID-19.

Predicted: COVID-19 (1) Predicted: Healthy (0)
Actual: COVID-19 (1) 126 (TP) 118 (FN)
Actual: Healthy (0) 7 (FP) 349 (TN)

Total Predictions: 126 + 118 + 7 + 349 = 600

Now, let's calculate the key performance metrics.

Core Evaluation Metrics

1. Accuracy

Question:
Explanation
When to use it?
When to be cautious:

Formula
Accuracy=TP+TNTP+TN+FP+FN
Example
Accuracy=TP+TNTP+TN+FP+FN=126+349126+118+7+349=475600=0.791(79.1%)

2. Precision

Question:
Explanation
Formula
Precision=TPTP+FP
When to use it?

When the cost of a False Positive is high. For example, in spam detection, you want to be very sure an email is spam before you send it to the spam folder, to avoid missing important emails.


Example

Of all the patients the model flagged as having COVID-19, how many actually had it?

Precision=TPTP+FP=126126+7=126133=0.947(94.7%)

3. Recall (Sensitivity or True Positive Rate)

Question:
Explanation
Formula
Recall (Sensitivity)=TPTP+FN
When to use it?

-Recall is crucial when the cost of a False Negative is high and priority is when avoiding FNs.
- This metric plays a crucial role in areas such as medical diagnosis and quality control. When identifying diseases, missing a sick patient (a False Negative) can have severe consequences. Recall ensures that the actual positive cases are not overlooked.
- In quality control, for detecting defects or anomalies, recall helps identify all faulty products, minimizing FNs. Recall emphasizes the ability to find actual positive instances.

Example
Recall (Sensitivity)=TPTP+FN=126126+118=126244=0.516(51.6%)

4. Recall (Specificity or True Negative Rate)

Question:
Explanation
Formula
Specificity=TNTN+FP
When to use it?
Example
Specificity=TNTN+FP=349349+7=349356=0.980(98.0%)

5. F1-Score

Question:
Explanation
Formula
F1-Score=2×Precision×RecallPrecision+Recall
When to use it?
Limitations
Example
F1-Score=2×Precision×RecallPrecision+Recall=2×0.947×0.5160.947+0.516=0.668

Advanced Evaluation Techniques

1. ROC Curve and AUC

★ ROC (Receiver Operating Characteristic curve)

The Receiver Operating Characteristic (ROC) curve is a graph showing the performance of a classification model at all classification thresholds.

★ AUC (Area under the ROC curve)

roc-1.png|500

★ ROC curve at different Threshold

Classification threshold ROC curve TP FN
FP TN
Scores
Threshold:0.5 If the costs are roughly equivalent, point B may offer
the best balance between TPR and FPR.
!600
!200 40 8
7 44
Accuracy 0.85
Precision 0.85
Recall 0.83
Threshold:0.35 If false positives (false alarms) are highly costly, lower
threshold that gives a lower FPR, even if TPR is reduced
!600
!200 46 2
17 34
Accuracy 0.81
Precision 0.73
Recall 0.96
Threshold:0.65 If false positives are cheap and false negatives (missed
true positives) highly costly, then higher threshold maximizes TPR
!600
!200 27 21
1 50
Accuracy 0.78
Precision 0.96
Recall 0.56

★ Limitations

2. Precision-Recall Curve and AUC

AUC and ROC work well for comparing models when the dataset is roughly balanced between classes.

3. Cumulative Gain and Lift Curves

These curves are used to evaluate how well a model segments the population.

Python Example: Plotting Evaluation Metrics

Here is a Python code snippet that demonstrates how to generate and plot a Confusion Matrix, ROC Curve, and Precision-Recall Curve for a binary classification model using scikit-learn, seaborn, and matplotlib.

➛ Snippets Only

Plot 1: Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
            xticklabels=['Predicted Negative', 'Predicted Positive'],
            yticklabels=['Actual Negative', 'Actual Positive'])
Plot 2: ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

axes[1].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
Plot 3: Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
pr_auc = auc(recall, precision)

axes[2].plot(recall, precision, color='blue', lw=2, label=f'PR curve (area = {pr_auc:.2f})')

ML_AI/images/eval-1.png

Plot 4: Evaluation Matrix
from sklearn.metrics import classification_report

# The variables `cm`, `y_test`, and `y_pred` are from the previous cell.

# --- Calculate Metrics from Confusion Matrix ---
# cm is structured as: [[TN, FP], [FN, TP]]
TN, FP, FN, TP = cm.ravel()

# 1. Accuracy: (TP + TN) / Total
# Overall, how often is the classifier correct?
accuracy = (TP + TN) / (TP + TN + FP + FN)

# 2. Recall (Sensitivity or True Positive Rate): TP / (TP + FN)
# Of all the actual positive cases, how many did the model correctly identify?
recall = TP / (TP + FN)

# 3. Specificity (True Negative Rate): TN / (TN + FP)
# Of all the actual negative cases, how many did the model correctly identify?
specificity = TN / (TN + FP)

# 4. Precision: TP / (TP + FP)
# Of all the cases the model predicted as positive, how many were actually positive?
precision = TP / (TP + FP)

# 5. F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
# The harmonic mean of Precision and Recall, providing a single score that balances both.
f1_score = 2 * (precision * recall) / (precision + recall)

print("--- Calculated Manually from Confusion Matrix ---")
print(f"Accuracy:             {accuracy:.4f}")
print(f"Precision:            {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"Specificity:          {specificity:.4f}")
print(f"F1-Score:             {f1_score:.4f}")

print("\n" + "="*50 + "\n")

# --- Using sklearn's classification_report for a comprehensive summary ---
# This is the recommended and standard way to get these metrics.
print("--- Using sklearn.metrics.classification_report ---")
# 'support' is the number of actual occurrences of each class in y_test.
print(classification_report(y_test, y_pred, target_names=['Negative (0)', 'Positive (1)']))
--- Calculated Manually from Confusion Matrix ---
Accuracy:                0.7800
Precision:                0.7867
Recall (Sensitivity): 0.7763
Specificity:              0.7838
F1-Score:                0.7815

==================================================

--- Using sklearn.metrics.classification_report ---
              precision    recall  f1-score   support

Negative (0)       0.77      0.78      0.78       148
Positive (1)         0.79      0.78      0.78       152

	accuracy                                  0.78       300
   macro avg        0.78      0.78      0.78      300
weighted avg       0.78      0.78     0.78       300

Scenarios: Choosing the Right Metric

1. Cancer Detection:
2. Email Spam Filtering
3. Credit Card Fraud Detection
4. Customer Churn Prediction
5. Loan Application Approval
6. Predicting Equipment Failure in a Factory
7. A/B Testing for a Website Redesign
8. Hiring: Screening Resumes
9. Content Recommendation (e.g., Netflix)
10. Self-Driving Car: Pedestrian Detection
11. Detecting Rare disease
11. ROC Curve Perfect Classifier
12. Threshold Adjustment Effects
13. Imbalanced Dataset Metrics
14. Virus Detection Priority
15. Harmonic Mean vs Arithmetic Mean
16. Email Spam Filter: Protecting Legitimate Emails
17. ROC-AUC Score of 0.5
18. Specificity-Sensitivity Trade-off
19. Imbalanced Dataset (95% Negative Class)
20. Zero False Positives
21. False Alarm Classification
22. Loan Default Prediction
23. Random Model ROC Curve
24. High Recall, Low Precision Interpretation
25. Minimizing False Negatives
26. Precision-Recall Curve vs ROC Curve
27. Error Rate Calculation
(FP+FN)(TP+FN+TN+FP)
28. High ROC-AUC but Low Accuracy
29. ROC Curve X-Axis
30. Pre-Trial Detention Decision
Precision=TPTP+FP
31. Type 1 Error and Type II error

★ Which of the following best describes the relationship between Type I error and specificity?


★ Which of the following statements is true regarding Type I and Type II errors?


★ In a factory, a quality control system checks products for defects before they are shipped. The system is designed to minimize the shipment of defective products to customers. Which type of error is more critical to minimize in this scenario?