650 likes | 850 Views
Cognitive data analysis Nikolay Zagoruiko Institute of Mathematics of the Siberian Devision of the Russian Academy of Sciences, Pr. Koptyg 4, 630090 Novosibirsk, Russia,. zag@math.nsc.ru. Area of interests
E N D
Cognitive data analysisNikolay ZagoruikoInstitute of Mathematics of the Siberian Devisionof the Russian Academy of Sciences,Pr. Koptyg 4, 630090 Novosibirsk, Russia, zag@math.nsc.ru
Area of interests Data Analysis, Pattern Recognition, Empirical Prediction, Discovering of Regularities, Data Mining, Machine Learning, Knowledge Discovering, Intelligence Data Analysis Cognitive Calculations Human-centered approach: The person - object of studying its cognitive mechanisms The decision of new strategic tasks is impossible without the accelerated increase of an intellectual level of means of supervision, the analysis and management. The person - the subject using results of the analysisComplexity of functioning of these means and character of received results complicate understanding of results. In these conditions the person, actually, is excluded from a man-machine control system.
Specificity of DM tasks: Great volumes of data Polytypic attributes Quantity of attributes >> numbers of objects Presence of noise and blanks Absence of the information on distributions and dependences
Abundance of methods is result of absence the uniform approach to the decision of tasks of different type That can learn at the person?
What deciding rules the person uses?1967 Recognition 12 1 11
What deciding rules the person uses?1967 Taxonomy 12 1 11
1. Person understands a results if classes are divided by the perpendicular planes y y y Y’ X=0.8Y-3 x x x a X’
2. Personunderstands a results if classes are described by standards y y y * * * * * * Y’ * * x x x X’ Уникальная способность человека распознавать трудно различимые образы основана на его умении выбирать информативные признаки.
If at the solving of different classification tasks the person passes from one basis to another? Most likely, peoples use some universal psycho-physiological function Our hypothesis: Basic function, used by the person at the classification, recognition, feature selection etc., consists in measure of similarity
Similarity is not absolute, but a relative category Is a objectb close to a or it is distant? a b
Similarity is not absolute, but a relative category Is a objectb close to a or it is distant? a b a b c
Similarity is not absolute, but a relative category Is a objectb close to a or it is distant? a b a b c a b c We should know the answer on question: In competition with what?
Function of Cоmpetitive (Rival) Similarity(FRiS) B r2 A r1 z +1 F A B z r1 r2 -1
Compact ness All pattern recognition methods are based on hypothesis of compactnessBraverman E.M., 1962 The patterns are compact if -the number of boundary points is not enough in comparison with their common number; - compact patterns are separated from each other refer to not too elaborate borders.
Compact ness Similarity between objects of one pattern should be maximal Similarity between objects of different patterns should be minimal
Compactness Defensive capacity: Compact patterns should satisfy to condition of the Maximal similarity between objects of the same pattern
Compactness Tolerance: Compact patterns should satisfy to the condition Maximal difference of these objects with the objects of other patterns
Selection of the standards (stolps) AlgorithmFRiS-Stolp
Criteria Informativeness by Fisherfor normal distribution Compactness has the same sense and can be used as a criteria of informativeness, which is invariant to low of distribution and to relation of NM
Selection of feature Initial set of features Xo 1, 2, 3, …..… …. j…. …..… N Engine GRAD Variant of subset X <1,2,…,n> Criteria FRiS-compactness Bad Good
Algorithm GRAD GRAD It based on combination of two greedy algorithms: forwardandbackwardsearches. At a stage forward algorithm Addition is used J.L. Barabash, 1963 At a stage backward algorithm Deletion is used Merill T. and Green O.M., 1963
GRAD Algorithm AdDel To easing influence of collecting errors a relaxation method it is applied. n1 - number of most informative attributes, add-on to subsystem (Add), n2<n1 - number of less informative attributes, eliminated from subsystem (Del). AdDel Relaxation method:n steps forward - n/2 steps back Algorithm AdDel. Reliability (R) of recognition at different dimension space. R(AdDel) > R(DelAd) > R(Ad) > R(Del)
GRAD Algorithm GRAD • AdDel can work with groups of attributes (granules) of different capacity m=1,2,3,…: , , ,… The granules can be formed by the exhaustive search method. • But: Problem of combinatory explosion! Decision:orientation on individual informativeness of attributes f It allows to granulate a most informative part attributes only L Dependence of frequency f hits in an informative subsystem from serial number L on individual informativeness
GRAD Algorithm GRAD(GRanulated AdDel) 1. Independent testing N attributes Selection m1<<N first best (m1 granules power 1) 2. Forming combinations Selection m2<< first best (m2 granules power 2) 3. Forming combinations Selection m3<< first best (m3 granules power 3) M =<m1,m2,m3> - set of secondary attributes (granules) AdDel selects m*<<|M| best granules, which included n*<<N attributes
Criteria Comparison of the criteria (CV - FRiS) Order of attributesby informativeness ..............C = 0,661 ....... .......C = 0,883 noise noise N=100M=2*100 mt=2*35mC =2*65 +noise
Some real tasks Task K M N Medicine: Diagnostics of Diabetes II type 3 43 5520 Diagnostics of Prostate Cancer 4 322 17153 Recognition of type of Leukemia 2 38 7129 Microarray data 2 1000 500000 9 genetic tables 2 50-150 2000-12000 Physics: Complex analysis of spectra 7 20-400 1024 Commerse: Forecasting of book sealing (Data Mining Cup 2009) - 4812 1862
Recognition of two types of Leukemia - ALL and AML ALL AML Training set 38 27 11N= 7129 Control set342014 I. Guyon, J. Weston, S. Barnhill, V. Vapnik Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002, 46 1-3: pp. 389-422.
Pentium T=3 hours Pentium T=15 sec В 27 первых подпространствах P =34/34 FRiS Decision Rules P 0,72656537/1 , 1833/1 , 2641/2 , 4049/234 0,713731454/1 , 2641/1 , 4049/134 0,712082641/1 , 3264/1 , 4049/134 0,71077435/1 , 2641/2 , 4049/2 , 6800/134 0,709932266/1 , 2641/2 , 4049/234 0,709732266/1 , 2641/2 , 2724/1 , 4049/234 0,707112266/1 , 2641/2 , 3264/1 , 4049/234 0,705742641/2 , 3264/1 , 4049/2 , 4446/134 0,70532435/1 , 2641/2 , 2895/1 , 4049/234 0,702432641/2 , 2724/1 , 3862/1 , 4049/234 Training set 38 Test set 34 N g Vsuc Vext Vmed Tsuc Text Tmed P 7129 0,95 0,01 0,42 0,85 -0,05 0,42 29 4096 0,82 -0,67 0,30 0,71 -0,77 0,34 24 2048 0,97 0,00 0,51 0,85 -0,21 0,41 29 1024 1,00 0,41 0,66 0,94 -0,02 0,47 32 512 0,97 0,20 0,79 0,88 0,01 0,51 30 256 1,00 0,59 0,79 0,94 0,07 0,62 32 128 1,00 0,56 0,80 0,97 -0,03 0,46 33 64 1,00 0,45 0,76 0,94 0,11 0,51 32 32 1,00 0,45 0,65 0,97 0,00 0,39 33 • 1,00 0,25 0,66 1,00 0,03 0,38 34 8 1,00 0,21 0,66 1,00 0,05 0,49 34 4 0,97 0,01 0,49 0,91 -0,08 0,45 31 2 0,97 -0,02 0,42 0,88 -0,23 0,4430 1 0,92 -0,19 0,45 0,79 -0,27 0,2327 Name of gene Weight 2641/1 , 4049/1 33 2641/1 32 Zagoruiko N., Borisova I., Dyubanov V., Kutnenko O. I.Guyon, J.Weston, S.Barnhill, V.Vapnik
Comparison with 10 methods • Jeffery I.,Higgins D.,Culhane A. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. // • http://www.biomedcentral.com/1471-2105/7/359 9 tasksonmicroarray data.10 methods the feature selection. Independent attributes. Selection of n first (best). Criteria – min of errors on CV: 10 time by 50%. Decision rules: Support Vector Machine (SVM), Between Group Analysis (BGA), Naive Bayes Classification (NBC), K-Nearest Neighbors (KNN).
Methods of selection Methods Results Significance analysis of microarrays (SAM) 42 Analysis of variance (ANOVA)43 Empirical Bayes t-statistic 32 Template matching38 maxT37 Between group analysis (BGA) 43 Area under the receiver operating characteristic curve (ROC) 37 Welch t-statistic39 Fold change47 Rank products 42 FRiS-GRAD 12 Empirical Bayes t-statistic – for middle set of objects Area under a ROC curve – for small noise and large set Rank products – for large noise and small set
Results of comperasing • Задача N0m1/m2maxof 4 GRAD • ALL1 12625 95/33 100.0100.0 • ALL2 12625 24/101 78.2 80.8 • ALL3 12625 65/35 59.1 73.8 • ALL4 12625 26/67 82.1 83.9 • Prostate 12625 50/53 90.2 93.1 • Myeloma 12625 36/137 82.9 81.4 • ALL/AML 7129 47/25 95.9 100.0 • DLBCL 7129 58/19 94.393.5 • Colon 2000 22/40 88.6 89.5 average 85.7 88.4
Unsettled problems • Censoring of training set • Recognition with boundary • Stolp+corridor (FRiS+LDR) • Imputation • Associations • Unite of tasks of different types (UC+X) • Optimization of algorithms • Realization of program system (OTEX 2) • Applications (medicine, genetics,…) • …..
Conclusion FRiS-function: 1.Provides effective measure of similarity, informativeness and compactness 2.Provides unification of methods 3.Provides high quality of decisions Publications: http://math.nsc.ru/~wwwzag
Thank you! • Questions, please?
Stolp Decision rulesChoosing a standards(stolps) • The stolp is an object which protects own objects and does not attack another's objects Defensive capacity: Similarity of the objects to a stolp should be maximal a minimum of the miss of the targets, Tolerance: Similarity of the objects to another's objects - minimally a minimum of false alarms
Stolp Algorithm FRiS-Stolp Compact patterns should satisfy to two conditions: F(j,i)|b=(R2-R1)/(R2+R1) Defencive capacity: Maximal similarity of objects on stolp i Tolerance: Maximal difference of other’s objects with stolp i
Stolp Algorithm FRiS-Stolp F(j,i)|b=(R2-R1)/(R2+R1) Security:Maximal similarity of objects on stolp i Tolerance:Maximal difference of other’s objects with stolp i