I. What is Polynomial Transformation?

The Big Picture

Imagine you're trying to predict house prices using only square footage. You fit a simple linear model and get a straight line. But what if the relationship isn't straight? What if bigger houses don't just add value linearly—maybe they add value exponentially because luxury buyers pay premium prices for space?

This is where polynomial transformation comes in. It's a feature engineering technique that helps us model non-linear relationships by creating new features that are powers or cross-products of our original features.

💭 Simple analogy: Think of it as "bending" your straight line (linear regression) so it can fit a curved set of data points. Instead of forcing a straight ruler onto a curved surface, you're making the ruler flexible.

poly_reg-1.png|600

The Math Behind It

A polynomial transformation creates new features by raising existing numeric features to powers and (optionally) creating interaction terms.

Example 1: Single feature
If you have one feature x (say, square footage), a degree-2 transformation creates:

x[1,x,x2]

So if square footage = 1000, you now have:

Example 2: Two features
If you have two features, a and b (say, square footage and number of rooms), a degree-2 transformation creates:

[a,b][1,a,b,a2,ab,b2]

That ab term is called an interaction term. It captures the idea that square footage and number of rooms might work together to affect price. A 5-bedroom house with 4000 sq ft is worth much more than a 5-bedroom with 1000 sq ft—the features interact!

II. When Should You Use Polynomial Transformation?

Here's the golden rule: Use polynomial features when you have a non-linear relationship but want to keep using a linear model.

1. It Depends on Your Model

Not all models benefit from polynomial features. Let's break this down:

✅ Models that LOVE polynomial features:
These are models that assume linearity in their inputs:

Why? Because these models can only draw straight lines (or flat planes in higher dimensions). By giving them x2 or x3, you're letting them "see" curves without changing the model itself.

❌ Models that DON'T need polynomial features:

Why? These models can already capture non-linearities directly. Trees split on different thresholds, and neural networks use activation functions. Adding polynomial features to these models is usually redundant and just makes training slower.

When to still use polynomial features despite having complex models:
You might still use them if you need a model that's interpretable. For example, saying "sales increase with the square of advertising spend" is much easier to explain to a business stakeholder than "the neural network found a pattern."

2. You're Seeing Signs of Underfitting

An underfit model shows high error on both training and test data. It's like trying to fit a straight line through data that clearly curves.

Look at this example—we're trying to approximate part of a cosine function:

ML_AI/_feature_engineering/images/underfitting.png|700

Notice how:

Red flags for underfitting:

III. How to Select Features for Transformation?

You shouldn't blindly apply polynomial features to every variable. It adds complexity and can hurt performance. Here’s a practical guide to choosing the right features to transform.

1. Visual Inspection: Scatter Plots

This is the most direct method. For each numeric feature, create a scatter plot of that feature against your target variable.

!500
In this example, the LOESS line (in red) clearly shows a curve, suggesting a polynomial term would be beneficial.

2. Model Diagnostics: Residual Plots

If you've already built a linear model, its mistakes can tell you what it's missing. A residual is the error of a prediction (Actual Value - Predicted Value).

!500
A curved pattern in the residuals is a classic sign that you need to account for non-linearity.

3. Domain Knowledge

Sometimes, you know a relationship should be non-linear based on real-world principles. Don't ignore this!

4. Interaction Terms

Don't forget that polynomial features can also model how two features work together. If you suspect the effect of one feature depends on the level of another, you should include an interaction term (x1x2).

IV. What Power (Degree) Should You Use?

Choosing the degree is a balancing act. Too low, and you underfit. Too high, and you overfit. Here’s a guide to getting it right.

The Rule of Thumb: Keep it Simple

In 90% of real-world machine learning, you should stay between Degree 2 and Degree 3.

ML_AI/_feature_engineering/images/overfitting.png|500
The high-degree polynomial (green line) wiggles to catch every data point, making it a poor general model.

How to Find the Best Degree: A Practical Approach

You can't always "see" the best degree, especially with many features. Use a methodical, data-driven approach.

  1. Start with Degree 1 (Linear): Fit a simple linear model and establish a baseline performance metric (like R-squared for regression).
  2. Try Degree 2: Create degree-2 polynomial features and retrain your model.
    • Did the performance on your validation set improve significantly? If yes, this is a good sign.
  3. Cautiously Try Degree 3: If degree 2 gave a good boost, try degree 3.
    • If performance improves again, great.
    • If it improves only slightly, or if your validation performance gets worse, then stick with Degree 2. The added complexity isn't worth it.
Use Regularization as a Safety Net

If you must use a higher degree (e.g., 3 or 4), you must pair it with a regularized model like Ridge or Lasso Regression.

  • How it helps: Regularization adds a penalty for large coefficients. It automatically "shrinks" the coefficients of the less useful polynomial terms towards zero, effectively performing automated feature selection and reducing the risk of overfitting.

V. The Importance of Scaling

When you create polynomial features, especially with degrees higher than 1, the new features can have vastly different scales from each other. For example, if a feature x ranges from 1 to 100, x^2 will range from 1 to 10,000. This huge difference in scale can cause serious problems for many machine learning models.

Scaling your data is not just recommended; it's often required for the model to perform correctly.

Which Models Are Sensitive to Scale?

The Correct Order of Operations: Scale THEN Transform

The best practice is to standardize your features before applying the polynomial transformation.

Workflow: StandardScalerPolynomialFeaturesModel

Here’s why this is the preferred method:

  1. Prevents Numerical Instability: If you create polynomial features first from unscaled data, you can end up with extremely large numbers (e.g., 1000^3 = 1,000,000,000). This can cause numerical overflow errors during model training. Scaling first keeps all values in a controlled range (e.g., around -3 to 3 for StandardScaler).
  2. Reduces Multicollinearity: Centering the data by scaling (subtracting the mean) before creating polynomial terms helps reduce the correlation between a feature and its powers (e.g., between x and x2). This makes the model's coefficients more stable and interpretable.

While you can scale after creating polynomial features, it's less ideal because you miss out on the benefits above.

Simple Rule

If your model uses regularization, distances, or gradients, and you are creating polynomial features, you must scale your data. The best way to do this is to scale first, then apply the polynomial transformation.

VI. Risks and Disadvantages

While powerful, polynomial features are not a free lunch. You must be aware of the risks involved.

1. Overfitting

This is the biggest danger. As you increase the polynomial degree, the model becomes more flexible and can create very complex curves. A high-degree polynomial will fit your training data almost perfectly, but it will have learned the noise in your data, not the underlying signal. When you show it new data, its predictions will be wild and unreliable.

Mitigation:

2. Feature Explosion

The number of new features grows exponentially with the degree and the number of original features.

This leads to a very high-dimensional dataset, which can make training slow and memory-intensive (the "Curse of Dimensionality").

Mitigation:

3. Multicollinearity

A feature and its powers (e.g., x and x2) are naturally correlated. This is called multicollinearity. High multicollinearity can make the model's coefficient estimates unstable and difficult to interpret. A small change in the data could cause large swings in the coefficient values.

Mitigation:

VII. Python Implementation with Scikit-Learn

The best way to implement polynomial regression is by using a Pipeline. This ensures that your steps (scaling, transformation) are applied correctly and prevents data leakage from your test set.

Here is a complete, commented example.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# 1. Create some sample non-linear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.randn(80) * 0.1

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create a pipeline
# This is the key! It chains together the steps of our modeling process.
# We will compare a simple linear model to a polynomial one.

# Define the degree of the polynomial
degree = 4

# Create the pipeline object
# - Step 1: Scale the data
# - Step 2: Create polynomial features from the scaled data
# - Step 3: Fit a linear regression model on the new polynomial features
polynomial_regression = Pipeline([
    ("scaler", StandardScaler()),
    ("poly_features", PolynomialFeatures(degree=degree, include_bias=False)),
    ("linear_regression", LinearRegression())
])

# 4. Train the model
polynomial_regression.fit(X_train, y_train)

# 5. Evaluate the model
print(f"Train R^2: {polynomial_regression.score(X_train, y_train):.2f}")
print(f"Test R^2:  {polynomial_regression.score(X_test, y_test):.2f}")

# 6. Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label="Data points")

# Create a smooth line for plotting the model's prediction
X_plot = np.linspace(0, 5, 100).reshape(-1, 1)
y_plot = polynomial_regression.predict(X_plot)

plt.plot(X_plot, y_plot, color='red', linewidth=2, label=f"Polynomial Regression (degree {degree})")
plt.title("Polynomial Regression Fit")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.legend()
plt.show()

Key PolynomialFeatures Parameters

VIII. Summary & Best Practices

Here are the key takeaways for using polynomial transformation effectively.

Topic Best Practice
When to Use When you see a non-linear pattern between a feature and the target, but you want to use a linear model (like Linear/Logistic Regression).
How to Check Use scatter plots with a LOESS smoothing line or look for patterns in residual plots.
Choosing the Degree Start with degree 2. Only increase to 3 if validation performance improves significantly. Avoid degrees 4+.
Scaling Always scale your data if your model is sensitive to it (e.g., uses regularization or distances).
Order of Operations The correct workflow is Scale → Transform. Use a Pipeline to enforce this.
Feature Selection Apply polynomial features only to the specific features that exhibit non-linearity. Do not apply it to all features blindly.
Overfitting To combat overfitting from higher-degree polynomials, always use a regularized model like Ridge or Lasso.
Interpretation Polynomial features make models harder to interpret. Be prepared to explain relationships like "Price increases with the square of size."