200 likes | 205 Views
Classifiers in Atlas. CS240B Class Notes UCLA. Data Mining. Classifiers: Bayesian classifiers Decision trees The Apriori Algorithm DBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html. The Classification Task.
E N D
Classifiers in Atlas CS240B Class Notes UCLA
Data Mining • Classifiers: • Bayesian classifiers • Decision trees • The Apriori Algorithm • DBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html
The Classification Task • Input: a training set of tuples, each labelled with one class label • Output: a model (classifier) which assigns a class label to each tuple based on the other attributes. • The model can be used to predict the class of new tuples, for which the class label is missing or unknown • Some natural applications • credit approval • medical diagnosis • treatment effectiveness analysis
Train & Test • The tuples (observations, samples) are partitioned in training set + test set. • Classification is performed in two steps: • training - build the model from training set • Testing (for accuracy, etc.)
Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column
Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sample X the class label C such that P(C|X) is maximal
Estimating a-posteriori probabilities • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • P(X) is constant for all classes • P(C) = relative freq of class C samples • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • For Categorical attributes: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • Computationally this is a count with grouping
Bayesian Classifiers • The training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then: • The testing is a simple SQL query on SUMMARY • First operation is to verticalize the table
Decision tree obtained with ID3 (Quinlan 86) outlook sunny rain overcast humidity windy P high normal weak strong N P N P [0] [1] [3] [2] […] [4]
Decision Tree Classifiers • Computed in a recursive fashion • Various ways to split and computing the splitting function • First operation is to verticalize the table
Initial state: the node column Training set from Quinlan’s book
Gini index • E.g., two classes,Pos andNeg, and dataset S with pPos-elements and nNeg-elements. • fp = p/(p+n) fn = n/(p+n) gini(S) = 1 – fp2 - fn2 • If dataset S is split into S1,S2 ,S3 then ginisplit(S1,S2 ,S3) = gini(S1)·(p1+n1)/(p+n) + gini(S2)·(p2+n2)/(p+n) +gini(S3)·(p2+n2)/(p+n) These computations can be easily expressed in ATLaS
Programming in ATLaS • Table-based programming is powerful and natural for data intensive • SQL can be ackward and many extensions are possible • But even SQL `as is’ is adequate
The ATLaS System • The system compile ATLaS programs into C programs, which • Executes on Berkeley DB record manager • The 100 Apriori program compiles into 2,800 lines of C • Other data structures (R-trees, in-memory tables) have been added using the same API. • The system is now 54,000 lines of C++ code.
ATLaS: Conclusions • A native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL • Effective with Data Minining Applications • Also OLAP applications, and recursive queries, and temporal database applications • Complement current mechanisms based on UDFs and Data Blades • Supports and favors streaming aggregates (SQL implicit default is blocking) • Good basis for determining program properties: e.g. (non)monotonic and blocking behavior • These are lessons that future QLs cannot easily ignore.
The Future • Continuous queries on Data Streams • Other extensions and improvements • Stay tuned: www.wis.ucla.edu