Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset into a lower-dimensional space while retaining as much variance as possible.

How is goal achieved?: Instead of selecting or eliminating features directly, PCA creates new features (Principal Components) that are linear combinations of the original features.

Key Pointers

PCA works by finding the directions of maximum variance in the data and projecting the data onto these directions.
The $1^{s t}$ principal component is the direction of maximum variance in the data, the $2^{n d}$ principal component is the direction of maximum variance orthogonal to the first principal component, and so on.
Each principal component is a linear combination of the original features.

What is linear combination?

A linear combination simply means taking your existing columns of data, multiplying each one by a specific weight (a number), and adding them all together to create a brand new column.
That brand new column is your Principal Component.

What Does "Transforms a Dataset into a Lower-Dimensional Space" Mean?

When we say that PCA transforms a dataset into a lower-dimensional space, it means that we reduce the number of features while preserving as much important information as possible. Instead of selecting or eliminating individual features, PCA creates new features (Principal Components) that are combinations of the original features.

The Scenario: Student Test Scores

Below scenario illustrates, the meaning of Principal Component and linear combination

Imagine you have a dataset of high school students with three original features (variables):

$X_{1}$ : Math Score
$X_{2}$ : Physics Score
$X_{3}$ : Literature Score

The First Principal Component ( $P C_{1}$ )

When you run PCA, the algorithm looks for the largest pattern (the maximum variance) in the data. It realizes that Math and Physics move together, so it creates the First Principal Component by assigning heavy weights to those two subjects and a near-zero weight to Literature.

The mathematical linear combination for $P C_{1}$ might look like this:

P C_{1} = (0.70 \times X_{1}) + (0.70 \times X_{2}) + (0.05 \times X_{3})

What this means:

The numbers $0.70$ , $0.70$ , and $0.05$ are the weights.
If a student gets a 90 in Math, a 95 in Physics, and a 60 in Literature, their single $P C_{1}$ score is: $(0.70 \times 90) + (0.70 \times 95) + (0.05 \times 60) = 132.5$ .
Because the weights for Math and Physics are the driving force behind this equation, $P C_{1}$ is no longer just a random number—it has naturally become an index representing a student's "Overall STEM Aptitude."

The Second Principal Component ( $P C_{2}$ )

PCA then looks for the next biggest pattern that is entirely unrelated (orthogonal) to the first one. It creates a second linear combination.

P C_{2} = (- 0.10 \times X_{1}) + (- 0.10 \times X_{2}) + (0.95 \times X_{3})

What this means:

Here, Literature ( $X_{3}$ ) gets the massive $0.95$ weight, while Math and Physics actually get slightly negative weights.
If a student is brilliant at Literature but struggles in Math, this equation will output a high number.
$P C_{2}$ has organically become an index representing a student's "Humanities Aptitude."

Step-by-Step Computation:

Standardization:
PCA is sensitive to the scale of data, so we must subtract the mean of each column from every value so that the new mean is zero. ( $μ = 0, σ = 1$ )
Set up the standardized data in a matrix, with each row being an object and the columns are the parameter values - there can be no missing data.
Covariance Matrix:
Calculate relationships between all feature pairs from the data matrix.
Eigen-Decomposition:
Compute the eigenvalues (variance magnitude) and eigenvectors (PC directions) of the covariance matrix
Sort & Select
Sort the eigenvectors in descending order of their corresponding eigenvalues.
Select the top k eigenvectors that correspond to the largest eigenvalues, where k is the desired number of principal components. (Do this step only if you need to reduce dimensionality, as it will eliminate information from the data.)
Project:
Project the data onto the k selected eigenvectors to obtain the reduced dimensional representation

How to Choose $k$ ?

Scree Plot: Look for the "elbow" where variance gain drops.
Cumulative Explained Variance: Choose $k$ such that $\approx 90 - 95 %$ of total variance is retained.

The Scenario: Wine quality

Imagine we have two features for 3 bottles of wine:

Alcohol Content ( $X_{1}$ )
Color Intensity ( $X_{2}$ )

Bottle	X1 (Alcohol)	X2 (Color)
A	10	2
B	20	8
C	30	5

Step 1: Standardize the Data (Mean Centering)

Mean $X_{1}$ : $(10 + 20 + 30) / 3 = 20$
Mean $X_{2}$ : $(2 + 8 + 5) / 3 = 5$
Centered Data ( $X_{c e n t e r e d}$ ):

Bottle	X1 (Alcohol)	X2 (Color)
A	10-20=-10	2-5=-3
B	20-20=0	8-5=3
C	30-20=10	5-5=0

Step 2: Calculate the Covariance Matrix

We want to see how $X_{1}$ and $X_{2}$ vary together. We use the formula

C o v (X, Y) = \frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{n - 1}

Variance $X_{1}$ : $[(- 10)^{2} + 0^{2} + 10^{2}] / 2 = 100$
Variance $X_{2}$ : $[(- 3)^{2} + 3^{2} + 0^{2}] / 2 = 9$
Covariance ( $X_{1}, X_{2}$ ): $[(- 10 \times - 3) + (0 \times 3) + (10 \times 0)] / 2 = 15$
The Covariance Matrix ( $Σ$ ):

[\begin{matrix} 100 & 15 \\ 15 & 9 \end{matrix}]

Step 3.a: Calculate Eigenvalues ( $λ$ )

We solve the characteristic equation: $d e t (Σ - λ I) = 0$ .

d e t [\begin{matrix} 100 - λ & 15 \\ 15 & 9 - λ \end{matrix}] = 0

(100 - λ) (9 - λ) - (15 \times 15) = 0

λ^{2} - 109 λ + 675 = 0

Using the quadratic formula, we find:

$λ_{1} \approx 102.4$ (Captures $92 %$ of variance)
$λ_{2} \approx 6.6$ (Captures $8 %$ of variance)

Step 3.b: Calculate Eigenvectors ( $v$ )

We plug $λ_{1} = 102.4$ back into $(Σ - λ I) v = 0$ to find the direction.

(100 - 102.4) x + 15 y = 0 \to - 2.4 x + 15 y = 0 \to y = 0.16 x

After normalizing (so the vector length is 1), our Eigenvector 1 is approximately:

v_{1} = [0.987, 0.158]

Interpretation: To make PC1, we take $98.7 %$ of Alcohol and $15.8 %$ of Color.

Step 4: Project the Data onto the New PC

Now we transform our original centered points into their new 1D "PC score" using the dot product: $P C_{1} = (X_{c e n t e r e d} \cdot v_{1})$ .

Bottle A: $(- 10 \times 0.987) + (- 3 \times 0.158) = - 10.34$
Bottle B: $(0 \times 0.987) + (3 \times 0.158) = 0.47$
Bottle C: $(10 \times 0.987) + (0 \times 0.158) = 9.87$

Final Result

We have successfully reduced our 2-column dataset into a single column ( $P C_{1}$ ):

Bottle	Original Features (2D)	PCA Feature (1D)
A	(10, 2)	-10.34
B	(20, 8)	0.47
C	(30, 5)	9.87

Interpretation: These three numbers now represent the "essence" of the wine. You can now use this single column in a machine learning model, knowing it contains $92 %$ of the information that used to require two columns.

Advantages:

✅ Efficient for large datasets
✅ Removes noise and multicollinearity
✅ Improves model performance/speed

Limitations:

❌ Linearity: Assumes linear relationships between features.
❌ Interpretability: New components are linear combinations; their "meaning" is often lost.
❌ Scale Sensitive: Requires standardization.
❌ Outliers: Variance-driven, so outliers can skew components.