Enhancing Boundary Stability in Decision Trees and Random Forests: A Weighted Sample Duplication Approach
Decision trees and their ensemble extensions, such as random forests, are widely used as classification models due to their simplicity and interpretability. However, in many real-world tasks where class labels overlap in the feature space, standard decision trees rely on hard splits that create fragile decision boundaries. In these regions, small perturbations in the input values can lead to misclassification, reducing the reliability of the model. To address this issue, we propose a localized data duplication mechanism that modifies the standard CART algorithm by duplicating samples located near the chosen split threshold into both child nodes. To prevent these duplicated samples from overpowering the nodes, they are assigned a reduced weight based on a smoothly decaying function relative to their distance from the threshold. This approach allows both child nodes to learn from ambiguous regions, preserving information about uncertainty while maintaining the axis-aligned deterministic structure of classical decision trees. When applied within a random forest framework, the duplication process also increases ensemble diversity. Experimental evaluation on 11 real-world datasets with varying degrees of class overlap demonstrates that the proposed modification consistently improves ROC-AUC scores and boundary stability while keeping computational costs low.