Gini Index
👉
👉 The Gini Index is designed for categorical target variables
The Gini index, quantifies how messy or clean a dataset is when you use decision trees for classification.
It ranges from 0 (the cleanest, where all data points have the same label) to 1 (the messiest, where data points are evenly split among all labels). Essentially, it measures the probability of a randomly chosen instance being misclassified by the decision tree algorithm.
Significance in Decision Trees
1. Acting as a Splitting Criterion:
Decision trees use the Gini index as a splitting criterion to determine how to divide the dataset into branches. The goal is to minimize impurity (the Gini index) at each node, leading to more accurate predictions.
2. Choosing the Best Split:
When constructing the decision tree, the algorithm evaluates different features (attributes) to decide which one should be placed at the root node. The significance of the Gini Index is that the tree will always choose the split that results in the lowest possible Gini score. The feature with the lowest Gini index after the split becomes the root node. Subsequent splits continue this process, optimizing the Gini index at each level.
🤔 - If the loss function is the compass that guides the entire model (Decision Tree, Random Forest...), the Gini Index is the compass that guides each split in process.
Formula for the Gini Index
The Gini index for a given node can be calculated as follows:
where (k) is the number of classes (labels), and (
Example of a Gini Index Calculation
| Chest Pain |
Blocked Arteries |
Patient Weight |
Heart Disease |
|---|---|---|---|
| Yes | Yes | 205 | Yes |
| No | Yes | 180 | Yes |
| Yes | No | 210 | Yes |
| Yes | Yes | 167 | Yes |
| No | Yes | 156 | No |
| No | Yes | 125 | No |
| Yes | No | 168 | No |
| Yes | Yes | 172 | No |
Our goal is to find the split that results in the lowest Gini Index (the purest buckets of data).
As we have 2 classes "yes" and "no", to find the Gini impurity of any single bucket, we calculate:
Where
is the probability of picking a "Yes" or "No" for Heart Disease in that specific bucket.
Here is the step-by-step breakdown for each feature.
1. Chest Pain
If we split the data based on whether the patient has Chest Pain, we create two buckets:
Bucket 1: Chest Pain = Yes (5 patients total)
- Target Mix: 3 Yes, 2 No
Bucket 2: Chest Pain = No (3 patients total)
- Target Mix: 1 Yes, 2 No
Weighted Average for Chest Pain:
To get the final score for the feature, we weigh each bucket by how many patients went into it (5/8 and 3/8):
2. Blocked Arteries
If we split the data based on Blocked Arteries, we create two new buckets:
Bucket 1: Blocked Arteries = Yes (6 patients total)
- Target Mix: 3 Yes, 3 No
Bucket 2: Blocked Arteries = No (2 patients total)
- Target Mix: 1 Yes, 1 No
Weighted Average for Blocked Arteries:
Since both buckets are perfectly mixed (a 50/50 toss-up), the weighted Gini is the maximum possible impurity.
3. Patient Weight (Continuous Data)
Because weight is a number, not a "Yes/No" category, the tree has to work a little harder. It sorts the patients by weight from lowest to highest, tests every single midpoint between the weights as a potential split, and picks the best one.
If we test all the midpoints, the optimal split happens at 176 (the midpoint between patient 172 and patient 180). Let's look at the buckets this creates:
Bucket 1: Weight
- Target Mix: 1 Yes, 4 No
Bucket 2: Weight > 176 (3 patients total)
- Target Mix: 3 Yes, 0 No (Notice this is a perfectly pure bucket!)
Weighted Average for Patient Weight:
Final Decision: How Will It Choose to Split?
Let's look at our final scoreboard:
- Blocked Arteries: 0.500
- Chest Pain: 0.467
- Patient Weight (
176): 0.200
Because Patient Weight has the lowest Gini Index by a massive margin, the decision tree will choose to split the dataset on Patient Weight
Gini Index for "continuous" data
As we roughly mentioned in above example's 3th point (for Patient's Weight) — test all the midpoints , now if you have a dataset with 10 rows, testing every single midpoint between weights takes a fraction of a second. But if you have 10 million rows, testing 9,999,999 midpoints for every single feature, at every single level of the tree, will cause your computer to freeze up and take days to run.
Older algorithms like CART decision trees did try to sort all the data and test every exact midpoint. This is called the Exact Greedy Algorithm, and as you guessed, it scales terribly.
To solve this scaling problem for large significantly sized dataset scikit-learn use advanced optimization shortcuts to pick split points efficiently.
1. Sorting and Eliminating Duplicates
The algorithm first sorts the data in ascending order. If there are duplicate values (e.g., thousands of rows where age is exactly 25), it filters them out. Midpoints are only calculated between consecutive unique values, which instantly reduces the number of candidate split points.
2. Skipping Consecutive Identical Classes
Even with unique values, testing every midpoint is wasteful if the target class doesn't change. The algorithm scans the sorted data and only considers midpoints where the target label changes.
- Example: If values
10.1,10.2, and10.3all belong to "Class A", the algorithm will skip testing splits between them. It will only test a midpoint when "Class A" switches to "Class B".
3. Cumulative Histograms (Binning)
For massive datasets (like millions of rows), frameworks like LightGBM or XGBoost use a histogram-based approach.
- Instead of looking at individual values, they bucket the continuous data into a fixed number of bins (usually 256 bins).
- The boundaries of these bins become the only candidate split points.
- This reduces the number of midpoints to evaluate from millions down to a maximum of 255 options, drastically speeding up the process.
4. Quantile Sketching
If the data cannot fit into memory, the system uses algorithms to find approximate quantiles. It distributes the data, finds where the percentiles (like the 1st, 2nd, 3rd... up to 99th percentile) lie, and uses those exact percentile marks as the midpoints.
Which Algorithm uses these 4 approach?
The four approaches are not all part of a single algorithm. Instead, different machine learning frameworks combine specific techniques depending on their design goals (maximizing accuracy vs. maximizing speed).
Here is how the main machine learning algorithms pick and mix these approaches:
I. Scikit-learn (Standard CART Algorithm)
Scikit-learn’s standard DecisionTreeClassifier focuses on exact math. It is designed for datasets that fit completely in your computer's RAM.
- It uses Approach 1 (Sorting & unique values) + Approach 2 (Skipping identical classes).
- It evaluates the exact mathematical best split point but can become slow when datasets grow to tens of millions of rows.
2. LightGBM & Hist-Gradient Boosting
Microsoft developed LightGBM specifically to solve the speed bottlenecks of large datasets. Scikit-learn also copied this method in its newer HistGradientBoostingClassifier.
- It skips standard sorting and relies entirely on Approach 3 (Binning continuous data into histograms).
- It reduces millions of potential midpoints down to just 255 candidates, making it incredibly fast.
3. XGBoost (Extreme Gradient Boosting)
XGBoost is built to handle massive, distributed datasets that might be too large for a single computer's memory.
- It uses Approach 3 (Binning) combined with Approach 4 (Quantile Sketching).
- The quantile sketch safely estimates where the data bins should be placed across multiple servers without needing to sort the entire multi-gigabyte dataset in one place.