Gini Index

👉 Gini Index=Gini Coefficient=Gini Ratio=Gini Impurity
👉 The Gini Index is designed for categorical target variables

The Gini index, quantifies how messy or clean a dataset is when you use decision trees for classification.
It ranges from 0 (the cleanest, where all data points have the same label) to 1 (the messiest, where data points are evenly split among all labels). Essentially, it measures the probability of a randomly chosen instance being misclassified by the decision tree algorithm.

Significance in Decision Trees

1. Acting as a Splitting Criterion:

Decision trees use the Gini index as a splitting criterion to determine how to divide the dataset into branches. The goal is to minimize impurity (the Gini index) at each node, leading to more accurate predictions.

2. Choosing the Best Split:

When constructing the decision tree, the algorithm evaluates different features (attributes) to decide which one should be placed at the root node. The significance of the Gini Index is that the tree will always choose the split that results in the lowest possible Gini score. The feature with the lowest Gini index after the split becomes the root node. Subsequent splits continue this process, optimizing the Gini index at each level.

🤔 - If the loss function is the compass that guides the entire model (Decision Tree, Random Forest...), the Gini Index is the compass that guides each split in process.

Formula for the Gini Index

The Gini index for a given node can be calculated as follows:

Gini Index=1i=1k=pi2

where (k) is the number of classes (labels), and (pi) represents the proportion of instances belonging to class (i) in the node.

Example of a Gini Index Calculation

Chest
Pain
Blocked
Arteries
Patient
Weight
Heart
Disease
Yes Yes 205 Yes
No Yes 180 Yes
Yes No 210 Yes
Yes Yes 167 Yes
No Yes 156 No
No Yes 125 No
Yes No 168 No
Yes Yes 172 No

Our goal is to find the split that results in the lowest Gini Index (the purest buckets of data).
As we have 2 classes "yes" and "no", to find the Gini impurity of any single bucket, we calculate:

Gini Index=1(Pyes2+Pno2)

Where

Here is the step-by-step breakdown for each feature.

1. Chest Pain

If we split the data based on whether the patient has Chest Pain, we create two buckets:

Bucket 1: Chest Pain = Yes (5 patients total)

Gini=1((35)2+(25)2)=1(0.36+0.16)=0.48

Bucket 2: Chest Pain = No (3 patients total)

Gini=1((13)2+(23)2)1(0.111+0.444)0.444

Weighted Average for Chest Pain:
To get the final score for the feature, we weigh each bucket by how many patients went into it (5/8 and 3/8):

Weighted Gini=(58×0.48)+(38×0.444)0.467

2. Blocked Arteries

If we split the data based on Blocked Arteries, we create two new buckets:

Bucket 1: Blocked Arteries = Yes (6 patients total)

Gini=1((36)2+(36)2)=1(0.25+0.25)=0.5

Bucket 2: Blocked Arteries = No (2 patients total)

Gini=1((12)2+(12)2)=1(0.25+0.25)=0.5

Weighted Average for Blocked Arteries:

Since both buckets are perfectly mixed (a 50/50 toss-up), the weighted Gini is the maximum possible impurity.

Weighted Gini =(68×0.5)+(28×0.5)=0.5

3. Patient Weight (Continuous Data)

Because weight is a number, not a "Yes/No" category, the tree has to work a little harder. It sorts the patients by weight from lowest to highest, tests every single midpoint between the weights as a potential split, and picks the best one.

If we test all the midpoints, the optimal split happens at 176 (the midpoint between patient 172 and patient 180). Let's look at the buckets this creates:

Bucket 1: Weight 176 (5 patients total)

Gini=1((15)2+(45)2)=1(0.04+0.64)=0.32

Bucket 2: Weight > 176 (3 patients total)

Gini=1((33)2+(03)2)=1(1+0)=0

Weighted Average for Patient Weight:

Weighted Gini=(58×0.32)+(38×0)=0.2

Final Decision: How Will It Choose to Split?

Let's look at our final scoreboard:

Because Patient Weight has the lowest Gini Index by a massive margin, the decision tree will choose to split the dataset on Patient Weight 176 for its very first root node. It successfully separated a perfectly pure group of patients with Heart Disease right out of the gate!


Gini Index for "continuous" data

As we roughly mentioned in above example's 3th point (for Patient's Weight) — test all the midpoints , now if you have a dataset with 10 rows, testing every single midpoint between weights takes a fraction of a second. But if you have 10 million rows, testing 9,999,999 midpoints for every single feature, at every single level of the tree, will cause your computer to freeze up and take days to run.

Older algorithms like CART decision trees did try to sort all the data and test every exact midpoint. This is called the Exact Greedy Algorithm, and as you guessed, it scales terribly.

To solve this scaling problem for large significantly sized dataset scikit-learn use advanced optimization shortcuts to pick split points efficiently.

1. Sorting and Eliminating Duplicates

The algorithm first sorts the data in ascending order. If there are duplicate values (e.g., thousands of rows where age is exactly 25), it filters them out. Midpoints are only calculated between consecutive unique values, which instantly reduces the number of candidate split points.

2. Skipping Consecutive Identical Classes

Even with unique values, testing every midpoint is wasteful if the target class doesn't change. The algorithm scans the sorted data and only considers midpoints where the target label changes.

3. Cumulative Histograms (Binning)

For massive datasets (like millions of rows), frameworks like LightGBM or XGBoost use a histogram-based approach.

4. Quantile Sketching

If the data cannot fit into memory, the system uses algorithms to find approximate quantiles. It distributes the data, finds where the percentiles (like the 1st, 2nd, 3rd... up to 99th percentile) lie, and uses those exact percentile marks as the midpoints.

Which Algorithm uses these 4 approach?

The four approaches are not all part of a single algorithm. Instead, different machine learning frameworks combine specific techniques depending on their design goals (maximizing accuracy vs. maximizing speed).

Here is how the main machine learning algorithms pick and mix these approaches:

I. Scikit-learn (Standard CART Algorithm)

Scikit-learn’s standard DecisionTreeClassifier focuses on exact math. It is designed for datasets that fit completely in your computer's RAM.

2. LightGBM & Hist-Gradient Boosting

Microsoft developed LightGBM specifically to solve the speed bottlenecks of large datasets. Scikit-learn also copied this method in its newer HistGradientBoostingClassifier.

3. XGBoost (Extreme Gradient Boosting)

XGBoost is built to handle massive, distributed datasets that might be too large for a single computer's memory.