Gini Index

👉 $Gini Index = Gini Coefficient = Gini Ratio = Gini Impurity$
👉 The Gini Index is designed for categorical target variables

The Gini index, quantifies how messy or clean a dataset is when you use decision trees for classification.
It ranges from 0 (the cleanest, where all data points have the same label) to 1 (the messiest, where data points are evenly split among all labels). Essentially, it measures the probability of a randomly chosen instance being misclassified by the decision tree algorithm.

Significance in Decision Trees

1. Acting as a Splitting Criterion:

Decision trees use the Gini index as a splitting criterion to determine how to divide the dataset into branches. The goal is to minimize impurity (the Gini index) at each node, leading to more accurate predictions.

2. Choosing the Best Split:

When constructing the decision tree, the algorithm evaluates different features (attributes) to decide which one should be placed at the root node. The significance of the Gini Index is that the tree will always choose the split that results in the lowest possible Gini score. The feature with the lowest Gini index after the split becomes the root node. Subsequent splits continue this process, optimizing the Gini index at each level.

🤔 - If the loss function is the compass that guides the entire model (Decision Tree, Random Forest...), the Gini Index is the compass that guides each split in process.

Formula for the Gini Index

The Gini index for a given node can be calculated as follows:

Gini Index = 1 - \sum_{i = 1}^{k} = p_{i}^{2}

where (k) is the number of classes (labels), and ( $p_{i}$ ) represents the proportion of instances belonging to class (i) in the node.

Example of a Gini Index Calculation

Chest Pain	Blocked Arteries	Patient Weight	Heart Disease
Yes	Yes	205	Yes
No	Yes	180	Yes
Yes	No	210	Yes
Yes	Yes	167	Yes
No	Yes	156	No
No	Yes	125	No
Yes	No	168	No
Yes	Yes	172	No

Our goal is to find the split that results in the lowest Gini Index (the purest buckets of data).
As we have 2 classes "yes" and "no", to find the Gini impurity of any single bucket, we calculate:

Gini Index = 1 - (P_{y e s}^{2} + P_{n o}^{2})

Where

$P$ is the probability of picking a "Yes" or "No" for Heart Disease in that specific bucket.

Here is the step-by-step breakdown for each feature.

1. Chest Pain

If we split the data based on whether the patient has Chest Pain, we create two buckets:

Bucket 1: Chest Pain = Yes (5 patients total)

Target Mix: 3 Yes, 2 No

G i n i = 1 - ({(\frac{3}{5})}^{2} + {(\frac{2}{5})}^{2}) = 1 - (0.36 + 0.16) = 0.48

Bucket 2: Chest Pain = No (3 patients total)

Target Mix: 1 Yes, 2 No

G i n i = 1 - ({(\frac{1}{3})}^{2} + {(\frac{2}{3})}^{2}) \approx 1 - (0.111 + 0.444) \approx 0.444

Weighted Average for Chest Pain:
To get the final score for the feature, we weigh each bucket by how many patients went into it (5/8 and 3/8):

Weighted Gini = (\frac{5}{8} \times 0.48) + (\frac{3}{8} \times 0.444) \approx 0.467

2. Blocked Arteries

If we split the data based on Blocked Arteries, we create two new buckets:

Bucket 1: Blocked Arteries = Yes (6 patients total)

Target Mix: 3 Yes, 3 No

G i n i = 1 - ({(\frac{3}{6})}^{2} + {(\frac{3}{6})}^{2}) = 1 - (0.25 + 0.25) = 0.5

Bucket 2: Blocked Arteries = No (2 patients total)

Target Mix: 1 Yes, 1 No

G i n i = 1 - ({(\frac{1}{2})}^{2} + {(\frac{1}{2})}^{2}) = 1 - (0.25 + 0.25) = 0.5

Weighted Average for Blocked Arteries:

Since both buckets are perfectly mixed (a 50/50 toss-up), the weighted Gini is the maximum possible impurity.

Weighted Gini = (\frac{6}{8} \times 0.5) + (\frac{2}{8} \times 0.5) = 0.5

3. Patient Weight (Continuous Data)

Because weight is a number, not a "Yes/No" category, the tree has to work a little harder. It sorts the patients by weight from lowest to highest, tests every single midpoint between the weights as a potential split, and picks the best one.

If we test all the midpoints, the optimal split happens at 176 (the midpoint between patient 172 and patient 180). Let's look at the buckets this creates:

Bucket 1: Weight $\leq$ 176 (5 patients total)

Target Mix: 1 Yes, 4 No

G i n i = 1 - ({(\frac{1}{5})}^{2} + {(\frac{4}{5})}^{2}) = 1 - (0.04 + 0.64) = 0.32

Bucket 2: Weight > 176 (3 patients total)

Target Mix: 3 Yes, 0 No (Notice this is a perfectly pure bucket!)

G i n i = 1 - ({(\frac{3}{3})}^{2} + {(\frac{0}{3})}^{2}) = 1 - (1 + 0) = 0

Weighted Average for Patient Weight:

Weighted Gini = (\frac{5}{8} \times 0.32) + (\frac{3}{8} \times 0) = 0.2

Final Decision: How Will It Choose to Split?

Let's look at our final scoreboard:

Blocked Arteries: 0.500
Chest Pain: 0.467
Patient Weight ( $\leq$ 176): 0.200

Because Patient Weight has the lowest Gini Index by a massive margin, the decision tree will choose to split the dataset on Patient Weight $\leq$ 176 for its very first root node. It successfully separated a perfectly pure group of patients with Heart Disease right out of the gate!

Gini Index for "continuous" data

As we roughly mentioned in above example's 3th point (for Patient's Weight) — test all the midpoints , now if you have a dataset with 10 rows, testing every single midpoint between weights takes a fraction of a second. But if you have 10 million rows, testing 9,999,999 midpoints for every single feature, at every single level of the tree, will cause your computer to freeze up and take days to run.

Older algorithms like CART decision trees did try to sort all the data and test every exact midpoint. This is called the Exact Greedy Algorithm, and as you guessed, it scales terribly.

To solve this scaling problem for large significantly sized dataset scikit-learn use advanced optimization shortcuts to pick split points efficiently.

1. Sorting and Eliminating Duplicates

The algorithm first sorts the data in ascending order. If there are duplicate values (e.g., thousands of rows where age is exactly 25), it filters them out. Midpoints are only calculated between consecutive unique values, which instantly reduces the number of candidate split points.

2. Skipping Consecutive Identical Classes

Even with unique values, testing every midpoint is wasteful if the target class doesn't change. The algorithm scans the sorted data and only considers midpoints where the target label changes.

Example: If values 10.1, 10.2, and 10.3 all belong to "Class A", the algorithm will skip testing splits between them. It will only test a midpoint when "Class A" switches to "Class B".

3. Cumulative Histograms (Binning)

For massive datasets (like millions of rows), frameworks like LightGBM or XGBoost use a histogram-based approach.

Instead of looking at individual values, they bucket the continuous data into a fixed number of bins (usually 256 bins).
The boundaries of these bins become the only candidate split points.
This reduces the number of midpoints to evaluate from millions down to a maximum of 255 options, drastically speeding up the process.

4. Quantile Sketching

If the data cannot fit into memory, the system uses algorithms to find approximate quantiles. It distributes the data, finds where the percentiles (like the 1st, 2nd, 3rd... up to 99th percentile) lie, and uses those exact percentile marks as the midpoints.

Which Algorithm uses these 4 approach?

The four approaches are not all part of a single algorithm. Instead, different machine learning frameworks combine specific techniques depending on their design goals (maximizing accuracy vs. maximizing speed).

Here is how the main machine learning algorithms pick and mix these approaches:

I. Scikit-learn (Standard CART Algorithm)

Scikit-learn’s standard DecisionTreeClassifier focuses on exact math. It is designed for datasets that fit completely in your computer's RAM.

It uses Approach 1 (Sorting & unique values) + Approach 2 (Skipping identical classes).
It evaluates the exact mathematical best split point but can become slow when datasets grow to tens of millions of rows.

2. LightGBM & Hist-Gradient Boosting

Microsoft developed LightGBM specifically to solve the speed bottlenecks of large datasets. Scikit-learn also copied this method in its newer HistGradientBoostingClassifier.

It skips standard sorting and relies entirely on Approach 3 (Binning continuous data into histograms).
It reduces millions of potential midpoints down to just 255 candidates, making it incredibly fast.

3. XGBoost (Extreme Gradient Boosting)

XGBoost is built to handle massive, distributed datasets that might be too large for a single computer's memory.

It uses Approach 3 (Binning) combined with Approach 4 (Quantile Sketching).
The quantile sketch safely estimates where the data bins should be placed across multiple servers without needing to sort the entire multi-gigabyte dataset in one place.