190 likes | 273 Views
Heterogeneous Forests of Decision Trees. Krzysztof Gr ą bczewski & W ł odzis ł aw Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.phys.uni.torun.pl/kmk. Motivation. Different classification systems:
E N D
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski& Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.phys.uni.torun.pl/kmk
Motivation Different classification systems: Black box systems (stat/neural) lack comprehensibility. Fuzzy logic or rough sets usually lead to complicated systems that are not understandable. Crisp logical rules may be the best solution. Advantages of logical rules: Comprehensibility (sometimes more important than the best accuracy). Can find the most important concepts of the problem Explain classification results (very important for instance in medicine) If simple, they show the most important features.
Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. Discovering simplest class structures, its inductive bias: requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria.
DT Forests Problem with DT (also NN): Not stable, small input changes lead to a different tree (network) structures. Heterogeneous Forests of Decision Trees: all simple trees may be interesting! An expert gets alternative problem descriptions. Solutions with different sensitivity and specificity for similar accuracy are generated.
Similarity-based HAS Local distance functions optimized differently in different regions of feature space. Weighted Minkovsky distance functions: Ex: a=20 and other types of functions, including probabilistic functions, changing piecewise linear decision borders. RBF networks with different transfer function; LVQ with different local functions.
HAS decision trees Decision trees select the best feature/threshold value for univariate and multivariate trees: Decision borders: hyperplanes. Introducing tests based on La Minkovsky metric. For L2 spherical decision border are produced. For L∞ rectangular border are produced. Many choices, for example Fisher Linear Discrimination decision trees.
Separability criterion Separate different classes as well as possible. Use both continuous and discrete attributes to generate for different features class separation indices that are comparable. How? Splitting continuous attributes (automatic and context-dependent generation of linguistic variables) choosing best discrete features values (by analysis of all subsets - due to the complexity 2N it is recommended to avoid discrete features with more than 10 values) combining best intervals and sets in a tree which can be easily converted to a set of classification rules.
SSV HAS DT Define left and right areas for test T with threshold (or subset) s: Count how many pairs of vectors from different classes are separated and how many vectors from the same class are separated.
SSV HAS algorithm Compromise between complexity/flexibility: • Use training vectors for reference R • Calculate TR(X)=D(X,R) for all data vectors, i.e. the distance matrix. • Use TR(X) as additional test conditions. • Calculate SSV(s) for each condition and select the best split. Different distance functions lead to different decision borders. Several distance functions are used simultaneously. 2000 points, noisy 10 D plane, rotated 45o, + half-sphere centered on the plane. Standard SSV tree: 44 rules, 99.7% HAS SSV tree (Euclidean): 15 rules, 99.9%
What to measure? Overall accuracy is not always the most important thing. Given a model M, confusion matrix for a class + and all other classes is: rows = true, columns = predicted by M
Quantities derived from p(Ci|Cj) Several quantitiesare used to evaluate classification models M created to distinguish C+ class:
SSV HAS Iris Iris data: 3 classes, 50 samples/class. SSV solution with the usual conditions (6 errors, 96%), or with distance test using vectors from a give node only: if petal length < 2.45 then class 1 if petal length > 2.45 and petal width < 1.65 then class 2 if petal length > 2.45 and petal width > 1.65 then class 3 SSV with Euclidean distance tests using all training vectors as reference (5 errors, 96.7%) 1. if petal length < 2.45 then class 1 2. if petal length > 2.45 and ||X-R15|| < 4.02 then class 2 3. if petal length > 2.45 and ||X-R15|| > 4.02 then class 3 ||X-R15|| is the Euclidean distance to the vector R15.
SSV HAS Wisconsin Wisconsin breast cancer dataset (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%). Single rule gives simplest known description of this data: IF ||X-R303|| < 20.27 then malignant else benign 18 errors, A=97.4%, S+= 97.9%, S-= 96.9%, K = 0.0672 Good prototype for malignant! Cost K for a=5. Simple thresholds, that’s what MDs like the most! Best L1O error 98.3% (FSM), best 10CV around 97.5% (Naïve Bayes + kernel, SVM) C 4.5 gives 94.7±2.0% SSV without distances: 96.4±2.1%
Wisconsin results 1 Tree 1 R1 If F3 < 2.5 then benign R2 If F6 < 2.5 F5 < 3.5 then benign R3 else malignant A=95.6% (25 err+6 uncl.) S+=95.0%, S-=95.9%, K=0.104 Tree 2 a = 5 R1 If F2 < 2.5 then benign R2 If F2 < 4.5 F6 < 3.5 then benign R3 else malignant A=95.0% (33 err+2 uncl.) S+=90.5%, S-=97.4%, K=0.107
Wisconsin results 2 Tree 3 R1 If F3 < 2.5 then benign R2 If F5 < 2.5 F6 < 2.5 then benign R3 else malignant A=95.1% (34 err) S+=95.0%, S-=95.2%, K=0.1156 Tree 4 R1 If F3 < 2.5 then benign R2 If F2 < 2.5 F5 < 2.5 then benign R3 else malignant A=95.1% (34 err.) S+=95.9%, S-=94.8%, K=0.1166
Breast cancer recurrence (Ljubliana) 286 cases, 201 no-recurrence-events (70.3%), 85 recurrence-events (29.7%). 9 attributes with 2-13 different values each. Difficult and noisy data, from UCI. Tree 1 R1 If deg-malig >2.5 inv-nodes > 2.5 then recurrence R2 If deg-malig >2.5 inv-nodes < 2.5 (tumor-size [25-34] tumor-size [50-54]) then recurrence-events R3 else no-recurrence-events A=76.9% (66 err.), S+=47.1%, S-=89.6%, K=0.526
Breast cancer recurrence (Ljubliana) Tree 2 R1 If breast = left inv-nodes > 2.5 then recurrence-events R2 else no-recurrence-events A=75.5% (70 err.), S+=30.5%, S-=94.5%, K=0.570 Tree 3 R1 If deg-malig > 2.5 inv-nodes > 2.5 then recurrence-events R2 else no-recurrence-events A=76.2% (68 err.), S+=31.8%, S-=95.0%, K=0.554 Best rule in CV tests, best interpretation.
Conclusions Heterogeneous systems are worth investigating. Good biological justification of HAS approach. Better learning cannot repair wrong bias of the model.StatLog report: large differences of RBF and MLP on many datasets. Networks, trees, kNN should select/optimize their functions. Radial and sigmoidal functions in NN are not the only choice. Simple solutions may be discovered by HAS systems. Open questions: How to train heterogeneous systems? Find optimal balance between complexity/flexibility? Ex. complexity of nodes vs. interactions (weights)? Hierarchical, modular networks: nodes that are networks themselves.
The End ? Perhaps still the beginning ...