Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset into a lower-dimensional space while retaining as much variance as possible.
- How is goal achieved?: Instead of selecting or eliminating features directly, PCA creates new features (Principal Components) that are linear combinations of the original features.
Key Pointers
- PCA works by finding the directions of maximum variance in the data and projecting the data onto these directions.
- The
principal component is the direction of maximum variance in the data, the principal component is the direction of maximum variance orthogonal to the first principal component, and so on. - Each principal component is a linear combination of the original features.
A linear combination simply means taking your existing columns of data, multiplying each one by a specific weight (a number), and adding them all together to create a brand new column.
That brand new column is your Principal Component.
When we say that PCA transforms a dataset into a lower-dimensional space, it means that we reduce the number of features while preserving as much important information as possible. Instead of selecting or eliminating individual features, PCA creates new features (Principal Components) that are combinations of the original features.
The Scenario: Student Test Scores
Below scenario illustrates, the meaning of Principal Component and linear combination
Imagine you have a dataset of high school students with three original features (variables):
: Math Score : Physics Score : Literature Score
The First Principal Component ( )
When you run PCA, the algorithm looks for the largest pattern (the maximum variance) in the data. It realizes that Math and Physics move together, so it creates the First Principal Component by assigning heavy weights to those two subjects and a near-zero weight to Literature.
The mathematical linear combination for
What this means:
- The numbers
, , and are the weights. - If a student gets a 90 in Math, a 95 in Physics, and a 60 in Literature, their single
score is: . - Because the weights for Math and Physics are the driving force behind this equation,
is no longer just a random number—it has naturally become an index representing a student's "Overall STEM Aptitude."
The Second Principal Component ( )
PCA then looks for the next biggest pattern that is entirely unrelated (orthogonal) to the first one. It creates a second linear combination.
What this means:
- Here, Literature (
) gets the massive weight, while Math and Physics actually get slightly negative weights. - If a student is brilliant at Literature but struggles in Math, this equation will output a high number.
has organically become an index representing a student's "Humanities Aptitude."
Step-by-Step Computation:
- Standardization:
PCA is sensitive to the scale of data, so we must subtract the mean of each column from every value so that the new mean is zero. ()
Set up the standardized data in a matrix, with each row being an object and the columns are the parameter values - there can be no missing data. - Covariance Matrix:
Calculate relationships between all feature pairs from the data matrix. - Eigen-Decomposition:
Compute the eigenvalues (variance magnitude) and eigenvectors (PC directions) of the covariance matrix - Sort & Select
Sort the eigenvectors in descending order of their corresponding eigenvalues.
Select the top k eigenvectors that correspond to the largest eigenvalues, where k is the desired number of principal components. (Do this step only if you need to reduce dimensionality, as it will eliminate information from the data.) - Project:
Project the data onto the k selected eigenvectors to obtain the reduced dimensional representation
How to Choose
- Scree Plot: Look for the "elbow" where variance gain drops.
- Cumulative Explained Variance: Choose
such that of total variance is retained.
The Scenario: Wine quality
Imagine we have two features for 3 bottles of wine:
- Alcohol Content (
) - Color Intensity (
)
| Bottle | X1 (Alcohol) | X2 (Color) |
|---|---|---|
| A | 10 | 2 |
| B | 20 | 8 |
| C | 30 | 5 |
Step 1: Standardize the Data (Mean Centering)
- Mean
: - Mean
: - Centered Data (
):
| Bottle | X1 (Alcohol) | X2 (Color) |
|---|---|---|
| A | 10-20=-10 | 2-5=-3 |
| B | 20-20=0 | 8-5=3 |
| C | 30-20=10 | 5-5=0 |
Step 2: Calculate the Covariance Matrix
We want to see how
- Variance
: - Variance
: - Covariance (
): - The Covariance Matrix (
):
Step 3.a: Calculate Eigenvalues ( )
We solve the characteristic equation:
Using the quadratic formula, we find:
(Captures of variance) (Captures of variance)
Step 3.b: Calculate Eigenvectors ( )
We plug
After normalizing (so the vector length is 1), our Eigenvector 1 is approximately:
Interpretation: To make PC1, we take
Step 4: Project the Data onto the New PC
Now we transform our original centered points into their new 1D "PC score" using the dot product:
- Bottle A:
- Bottle B:
- Bottle C:
Final Result
We have successfully reduced our 2-column dataset into a single column (
| Bottle | Original Features (2D) | PCA Feature (1D) |
|---|---|---|
| A | (10, 2) | -10.34 |
| B | (20, 8) | 0.47 |
| C | (30, 5) | 9.87 |
Interpretation: These three numbers now represent the "essence" of the wine. You can now use this single column in a machine learning model, knowing it contains
of the information that used to require two columns.
Advantages:
- ✅ Efficient for large datasets
- ✅ Removes noise and multicollinearity
- ✅ Improves model performance/speed
Limitations:
- ❌ Linearity: Assumes linear relationships between features.
- ❌ Interpretability: New components are linear combinations; their "meaning" is often lost.
- ❌ Scale Sensitive: Requires standardization.
- ❌ Outliers: Variance-driven, so outliers can skew components.