Parameter Optimized Vertical, Nearest Neighbor- Vote and Boundary Based Classification

Parameter OptimizedVertical, Nearest Neighbor- Vote and Boundary Based Classification Amal Perera, William Perrizo {amal.perera, william.perrizo}@ndsu.edu Dept. of CS, North Dakota State University. CATA 2007 – Honolulu, Hawaii

Outline • Introduction • Background • Our Approach • Experimental Results • Conclusions Parameter Optimized Vertical Classification

Introduction • Computer Aided Detection (CAD): • Interesting data mining applications • Typical Medical Image Data Sets are • Large • Extremely unbalanced between + & - classes • Large number of “irrelevant” features • Noisy Labels based on human decisions that take only a few features in to consideration • due to the human mind limitation: ~ 5 ± 2 contexts. • Major Requirement: • Extremely high performance thresholds for clinical acceptance. Parameter Optimized Vertical Classification

Introduction (Cont.) • Pulmonary Embolism (PE): 650,000 cases per year in US (root cause can be anything that stresses cardiovascular system). • Condition that occurs when • thromboses (blood clots), usually from the legs, • move thru ever enlargening venal system, to and through heart • into the ever narrowing pulmonary arterial system, • where they lodge and block lung arteries. • Highly lethal condition • symptoms are usually detected in an emergency room setting • diagnosis of true positives has to be followed by swift treatment • treatment usually involves a blood thinner (e.g., warfarin) • false positives can be very bad • symptoms can resemble brain aneurysm • and for aneurysms, one wants thick blood, not thin! • Holy Grail of PE CAD is fast detection of true negatives (High Negative Predictive Value (NPV) ) Parameter Optimized Vertical Classification

Introduction (Cont.) • Classification attributes are automatically generated from a large number of Computed Tomography Angiography (CTA) images. • Objective of a PE CAD system: • Identify the sick patients from the available descriptive features with high accuracy (especially NPV accuracy). • We applied: • Parameter Optimized Vertical, Nearest Neighbor-Vote and Boundary Based Classification • The approach was successfully used in ACM 2006 KDD Cup data mining competition (won the NPV task with a score that was twice as high as the nearest competitor). Parameter Optimized Vertical Classification

KDD 2006 PE Data 67 CTA Cases 4424 PE candidates 116 Features Generated from Computed Tomography Angiography study images (1 image per slice of lung) Parameter Optimized Vertical Classification

Background • Classification Similarity and Attribute Relevance. • Classical Classification Techniques such as KNN work when all attributes are similar in relevance. • Classical KNN tends to suffer (calculation speed) with large number of machine generated attributes (curse of dimensionality). • Current approaches • Weighting the attributes based on “relevance” (derived using a heuristic algorithm). • attribute selection (pruning or filtering) Parameter Optimized Vertical Classification

Our Approach • OurAttribute Selection (AS) step was followed by a combination of Gaussian Nearest Neighbor (GNN) and Local Class Boundary (LCB) based classification. • Classification parameters were optimized with Genetic Algorithm. • Training Set structured vertically into Predicate-trees or P-trees1 (losslessly compressed, data-mining-ready vertical structures). • attribute relevance analysis was done, • nearest neighbor sets were created, • class boundary analysis was done. • With compressed P-trees, processing can be done in compressed form (no need to uncompress, process and then compress again). Parameter Optimized Vertical Classification

P-tree* Vertical Data Structure • Predicate-trees (P-trees) • Lossless , Compressed, Data-mining-ready • Successfully used in KNN, ARM, Bayesian Classification, SVM... • A basic P-tree represents • One attribute bit-slice reorganized into a compressed tree. • By recursively sub-dividing, while recording the predicate truth value regarding purity of 1-bits for each division. • Each level of tree contains truth-bits that represent purely 1-bit sub-trees. • Construction is continued recursively down each tree path until a pure sub-division is reached (either pure 1-bits or pure 0-bits). * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. Parameter Optimized Vertical Classification

What are P-trees? R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Vertical Scan of Horizontal records R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42P43 3. 2nd half is not pure1  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not  0 0 0 1 0 1 01 5. 2nd half of 2nd half is  1 0 1 0 6. 1st half of 1st of 2nd is  1 Eg, Count occurences of 111 000 001 1000 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 =0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 P-trees are vertical structures (slices) processed horizontally (ANDs..) To create P-trees, compress bit slices as: Traditionally, data is structured horizontally(into records) and processed vertically (through scans). R11 0 0 0 0 1 0 1 1 The binary basic P-tree, P1,1, for bit-slice R11 is built top-down by record truth of predicate pure1 recursively on halves, until purity is reached. But it is pure (pure0) so this branch ends Parameter Optimized Vertical Classification

Attribute Relevance • P-tree processing speed allowed for multiple rounds of attribute relevance analysis, including: • Information gain based rounds, • statistics based rounds, • heuristic rounds. Parameter Optimized Vertical Classification

Gaussian Near Neighbor Vote based Classification (GNN) • A Gaussian weighted near neighbor vote (the Gaussian of the distance from the unclassified candidate,s, to the training neighbor candidate, xc, whose class label is c). • All Near Neighbor Set were Closed. 2 2 - = ´ ´ ( ( , ) ) sigma d x s Vote ( x ) e ClassPercen t VoteFact c c c c Parameter Optimized Vertical Classification

- ve (0) median sample midpoint + ve(1) median Local Boundary based Classification (LBC) • Since any smooth class boundary can be piecewise linearly approximated, • in a small neighborhood, the class boundary can be assumed linear. • An Inner Product analysis step was used to determine the side of the local boundary line containing unclassified candidate. Parameter Optimized Vertical Classification

Genetic Algorithms • GAs were used to improve parameter choices. Fitness was based on the task scoring criteria. • Convergence relied heavily upon the vertical (attribute relevance) and horizontal (near neighborhood) pruning of the training set. Attribute Weight Parameter Optimized Vertical Classification

Method Overview Horizontal Training Data Vertical Training Data (P-trees) Genetic Algorithm Attribute Relevance Analysis Param. Fitness Relevant Attributes Gaussian Near Neighbor Local Class Boundary Combination Classifier Test Data Final Results Optimized Classifier Parameter Optimized Vertical Classification

Experimental Results • Objective : Quality of Classification • Datasets • KDD-Cup-2006 PE data set primarily • also used the Iris Plant Data set • Quality • Negative Predictive Value (NPV) • NPV = TN / (TN+FN) • Where TN:True Negative FN:False Negatives Parameter Optimized Vertical Classification

Results : Quality • KDD 2006 Data Set • For the unknown test data set. (class labels of the test data set is hidden during the competition) • TN = 1.0 • NPV = 1.0 • Best submission - 2006 KDD Cup • Iris Data Set • Linear Separable Class • TN = 1.0 NPV=1.0 • Linearly non separable Class • TN = 0.98 NPV=1.0 Parameter Optimized Vertical Classification

Conclusions • We present a successful PE CAD classifier that addresses: • Large number of irrelevant attributes. • Extremely unbalanced + / - data • Extremely high performance threshold to achieve clinical acceptance • Our CAD classifier uses: • Weighted Nearest Neighbor classification • Inner Product / Median Class Boundary based classification. • Genetic Algorithm parameter optimization. • Ptree data structure for efficient computation. Parameter Optimized Vertical Classification

Thankyou Parameter Optimized Vertical Classification

Parameter Optimized Vertical, Nearest Neighbor- Vote and Boundary Based Classification

Parameter Optimized Vertical, Nearest Neighbor- Vote and Boundary Based Classification

Presentation Transcript

A Fast and Scalable Nearest Neighbor Based Classification

Classification Methods: k-Nearest Neighbor Naïve Bayes

Nearest Neighbor Classifiers

In Defense of Nearest-Neighbor Based Image Classification

Nearest Neighbor

Nearest neighbor matching

K Nearest Neighbor Classification Methods

Nearest-Neighbor Classifiers

Optimized Nearest Neighbor Methods

Classification Nearest Neighbor

Nearest Neighbor

K Nearest Neighbor Classification Methods

A k -Nearest Neighbor Based Algorithm for Multi-Label Classification

K Nearest Neighbor Classification Methods

K-Nearest Neighbor

Fast and Scalable Nearest Neighbor Based Classification

Classification Nearest Neighbor

Learning: Nearest Neighbor

Nearest Neighbor Classifier

Classification Nearest Neighbor

A Fast and Scalable Nearest Neighbor Based Classification

Parameter Optimized Vertical, Nearest Neighbor- Vote and Boundary Based Classification