Decision tree locally process classified data

Machine learning algorithms cannot handle categorical variables. But the decision tree (DTS) can. Classification trees also do not require numerical targets. Below is an illustration of a tree that divides the Cyrillic alphabet into vowels and consonants. It doesn’t use digital features – but it exists.
Many also use mean target encoding (MTE) as a clever way to convert categorical data into numerical forms – without exaggerating the feature space like single hot encoding. However, I haven’t seen this inherent connection between MTE on TD and decision tree logic. This article accurately solves this gap through illustrative experiments. especially:
- I will start with a quick review of how the decision tree handles classification features.
- We will see this become a computing challenge with high cardinality capabilities.
- I’ll demonstrate how the meaning of target encoding naturally emerges as a solution to this problem – unlike tag encoding.
- You can reproduce my experiments using GitHub’s code.
Quick Note: Fans of mean target encodings often have a negative description of single encodings – but that’s not as bad as they suggest. In fact, in our benchmark experiment, it usually ranks first among the 32 classification coding methods we evaluated. [1]
The Curse of Decision Trees and Classification Characteristics
Decision tree learning is a recursive algorithm. In each recursive step, it iterates over all functions, looking for the best split. Therefore, it is sufficient to check how a single recursive iteration handles the classification function. If you are not sure how this operation will be generalized to the construction of the entire tree, check it out here [2].
For a classification function, the algorithm evaluates all possible methods for dividing the category into two non-vacancies and selects one that produces the highest allocation quality. The quality is usually measured using Gini impurities for binary classification or regression root mean square error – both are better at lower levels. Please see the pseudocode below.
# ---------- Gini impurity criterion ----------
FUNCTION GiniImpurityForSplit(split):
left, right = split
total = size(left) + size(right)
RETURN (size(left)/total) * GiniOfGroup(left) +
(size(right)/total) * GiniOfGroup(right)
FUNCTION GiniOfGroup(group):
n = size(group)
IF n == 0: RETURN 0
ones = count(values equal 1 in group)
zeros = n - ones
p1 = ones / n
p0 = zeros / n
RETURN 1 - (p0² + p1²)
# ---------- Mean-squared-error criterion ----------
FUNCTION MSECriterionForSplit(split):
left, right = split
total = size(left) + size(right)
IF total == 0: RETURN 0
RETURN (size(left)/total) * MSEOfGroup(left) +
(size(right)/total) * MSEOfGroup(right)
FUNCTION MSEOfGroup(group):
n = size(group)
IF n == 0: RETURN 0
μ = mean(Value column of group)
RETURN sum( (v − μ)² for each v in group ) / n
Assume that the function has a cardinality k. Each category can belong to any of the two groups, giving 2ᵏ Total combination. Not including a small case of one of the two sets, we have 2 leftᵏ-2 feasible split. Next, note that we don’t care about the order of the sets – splits like {{a,b}, {c}}} and {{c}, {a, b}} are equivalent. This reduces the number of unique combinations in half, and the final count is (2ᵏ-2)/2 iteration. For our above toy example k = 5 Cyrillic letters, that number is 15. but k = 20it has balloons to 524,287 combinations – enough to significantly slow down DT training.
Average target coding solves efficiency problems
What to do if a person can lower the search space from 2 (2) (2) (2)ᵏ-2)/2 to something more manageable – without losing the best split? It turns out that this is indeed possible. It can be theoretically shown that the target encoding can reduce this [3]. Specifically, if these categories are arranged in order of their MTE values, and only splits respecting that order are considered, the optimal split (regression error of classification or equilibrium error according to Gini impurities) will be one of them. There is absolutely K-1 Such a split, with (2ᵏ-2)/2. The pseudocode for MTE is below.
# ---------- Mean-target encoding ----------
FUNCTION MeanTargetEncode(table):
category_means = average(Value) for each Category in table # Category → mean(Value)
encoded_column = lookup(table.Category, category_means) # replace label with mean
RETURN encoded_column
experiment
I will not repeat the theoretical derivation that supports the above claims. Instead, I designed an experiment to verify them empirically and understand the efficiency improvements MTE brings in native allocation that iterates in detail on all possible splits. In the following content, I explain the data generation process and experimental setup.
data
# ---------- Synthetic-dataset generator ----------
FUNCTION GenerateData(num_categories, rows_per_cat, target_type='binary'):
total_rows = num_categories * rows_per_cat
categories = ['Category_' + i for i in 1..num_categories]
category_col = repeat_each(categories, rows_per_cat)
IF target_type == 'continuous':
target_col = random_floats(0, 1, total_rows)
ELSE:
target_col = random_ints(0, 1, total_rows)
RETURN DataFrame{ 'Category': category_col,
'Value' : target_col }
Experimental settings
this experiment
Function lists basic lists and split criteria – Gini impurities or squared errors depending on the target type. For each classification feature basic in the list, it generates 100 data sets and compares two strategies: an exhaustive evaluation of all possible category splits and a limited order of MTE information. It measures the runtime of each method and checks whether both methods produce the same optimal allocation score. This feature returns the number of matched cases and the average run time. The pseudo-code is given below.
# ---------- Split comparison experiment ----------
FUNCTION RunExperiment(list_num_categories, splitting_criterion):
results = []
FOR k IN list_num_categories:
times_all = []
times_ord = []
REPEAT 100 times:
df = GenerateDataset(k, 100)
t0 = now()
s_all = MinScore(df, AllSplits, splitting_criterion)
t1 = now()
t2 = now()
s_ord = MinScore(df, MTEOrderedSplits, splitting_criterion)
t3 = now()
times_all.append(t1 - t0)
times_ord.append(t3 - t2)
IF round(s_all,10) != round(s_ord,10):
PRINT "Discrepancy at k=", k
results.append({
'k': k,
'avg_time_all': mean(times_all),
'avg_time_ord': mean(times_ord)
})
RETURN DataFrame(results)
result
You can take my words – or repeat experiments (github) – but as the theory predicts, the best split scores for both methods always match. The following figure shows the time it takes to evaluate segmentation, which is a function of the number of categories. The vertical axis is on a logarithmic scale. Lines representing exhaustive evaluations appear linear in these coordinates, which means that the runtime grows exponentially with the number of categories – confirming the theoretical complexity discussed earlier. Already in 12 categories (on a dataset with 1200 rows), it takes about one second to check all possible splits, three orders of magnitude slower than the MTE-based method, which yields the same optimal allocation.

in conclusion
Decision trees can process categorical data locally, but this ability occurs at computational cost when the category count grows. Average target encoding provides a shortcut in principle – greatly reducing the number of candidate splits without compromising the results. Our experiments confirm this theory: MTE-based ordering finds the same optimal split, but faster.
When writing, scikit-learn
The classification function is not directly supported. So, how do you think about it – if you use MTE for preprocessing data, will the resulting decision tree match that created by learners who process categorical characteristics locally?
refer to
[1] Benchmarks and taxonomy for classification coders. Going towards data science.
[2] Mining rules from data. Going towards data science.
[3] Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome. Elements of statistical learning: data mining, inference and prediction. roll. 2. New York: Springer, 2009.