In recent years, artificial intelligence has increasingly permeated society, including applications in high-stakes domains that may have a significant and even harmful impact on people. For these domains, it is essential to have comprehensible and reliable models. However, many o
...
In recent years, artificial intelligence has increasingly permeated society, including applications in high-stakes domains that may have a significant and even harmful impact on people. For these domains, it is essential to have comprehensible and reliable models. However, many of the most commonly used models, such as neural networks, are black-box models, i.e., inherently opaque and thus hard to trust.
Therefore, this dissertation explores decision trees as a human-interpretable alternative. Specifically, we focus on optimal decision trees, since the traditional greedy heuristics often learn decision trees that are too large to interpret easily, whereas optimal decision trees provably optimize the objective of a learning problem (e.g., accuracy) given a constraint on model size. The compactness and transparency of these optimal models contribute to their interpretability, while in many cases they achieve accuracy comparable to that of much more complex black-box models. In addition, with many optimal learning methods it is easier to specify a learning task other than the default of maximizing classification accuracy.
However, since learning optimal decision trees is an NP-hard problem, scalability is a major concern. State-of-the-art model-based approaches, such as mixed-integer programming and Boolean satisfiability, often do not scale beyond a few hundred samples and shallow depth limits. On the other hand, search-based approaches, such as dynamic programming, scale much better by exploiting the separable tree structure using specialized algorithms to solve subtrees as independent (and repeated) subproblems. But these methods are mostly limited to classification, while their scalability depends on a coarse binarization of the numerical data.
Therefore, in this thesis, we study optimal decision trees across four challenges: (i) efficiently optimizing learning tasks other than classification, (ii) improving our understanding of the differences between optimal and traditional greedy decision trees, (iii) efficiently handling data with numerical attributes; and (iv) efficiently finding the whole Rashomon set: the set of all (near-)optimal decision trees.
In the first challenge we address improving scalability for optimizing decision trees for a variety of learning tasks other than classification. We build on existing dynamic programming approaches for their scalability, and because this scalability depends on the property that subproblems are separable, we study and formalize the precise necessary and sufficient conditions for separability. These conditions turn out to be broader than previously assumed, and we use these new insights to improve the scalability of optimal decision trees for a variety of learning tasks, such as classification under fairness constraints, survival analysis, regression, prescriptive policy learning, and cost-sensitive classification. We also show how different optimization techniques, such as branch-and-bound and a specialized subroutine for depth-two trees, can be adapted and applied across several of these tasks. Together, these techniques improve scalability by one or more orders of magnitude over the state of the art for multiple learning tasks.
The second challenge is to improve our understanding of the differences between optimal and greedy heuristic methods for learning decision trees because previous comparisons were severely limited by lack of scalability. We empirically study the effect of the loss function and hyperparameter tuning of optimal decision trees and conclude that the heuristic and optimal approach are inherently different, for example, concave loss functions, which are required for many heuristics, do not work well for optimal methods. Previous comparisons also resulted in divergent and contradicting claims about optimal methods, and therefore, we empirically test six such claims from the literature. We confirm the most important one: when measuring both size (as a proxy of interpretability) and accuracy, optimal decision trees clearly dominate trees learned by greedy heuristics. We also refute several claims, including the claim that optimal methods are more prone to overfitting, and the claim that the differences between the two methods diminish with more data. For both claims, we find evidence suggesting the opposite.
The third challenge is to improve scalability when the data contains numerical attributes, without falling back to a coarse binarization as most dynamic programming approaches do. Therefore, we develop new algorithmic techniques that exploit the properties of numerical attributes. We show how reasoning about subproblem similarity makes it possible to prove that large parts of the search space can be pruned, while the ordering among numeric values makes it possible to compute aspects of the problem incrementally. Together, these techniques result in scalability improvements of one to two orders of magnitude compared to the state of the art.
For the fourth challenge, we go beyond learning a single optimal model. Instead, we focus on the set of all decision trees within a small percentage of the optimal solution, the so-called Rashomon set. The Rashomon set can be useful for data analysis, for example in determining which attributes have the greatest impact on predictions, or in finding better counterfactual explanations. Our contribution is a more efficient, anytime, sorted enumeration of all decision trees in the Rashomon set using lazy evaluation, as well as adapting and applying existing algorithmic techniques for optimal decision trees to compute the Rashomon set. Here too, our method improves scalability by one or more orders of magnitude compared to the previous best approach.
In summary, this dissertation improves scalability by several orders of magnitude, broadens the range of possible applications, and deepens both the theoretical and empirical understanding of optimal decision trees and decision tree Rashomon sets. Therefore, we can now recommend the use of optimal decision trees as human interpretable models for a large variety of real-world data science problems.