1 / 18

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs. Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani KDD-LDMTA‘10 July 25, 2010. Introduction. Efficient computation of very large datasets for DM, ML and statistics.

ellery
Download Presentation

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs Mario Navas, Carlos Ordonez, Veerabhadran Baladandayuthapani KDD-LDMTA‘10 July 25, 2010

  2. Introduction • Efficient computation of very large datasets for DM, ML and statistics. • Majority of work is done outside the DBMS. • External tools: • Process flat files (too big for RAM). • Disk I/O limitation. • Security risks. • DBMSs are extendable through SQL and UDFs.

  3. Contributions • Extend DBMS for PCA and Bayesian variable selection of large data. • Only one scan of the input table. • Maintain exact results for models generated. • Exploit multi-threaded aggregate functions for data summarization. • Optimal use of hardware resources • memory and multiple processors

  4. Preliminaries • Definitions • Input data X={x1,x2,…,xn}, with d dimensions. • Y is the response or variable of interest. • Xa is the ath dimension. • Data Summarization • Used to summarize essential matrix multiplications: XXT, XYT,YYT, Y 1n. (n, L, and Q).

  5. Preliminaries • Dimensionality reduction with PCA • Find a rotation U to project the input data into its principal components. • Bayesian Variable Selection • Stochastic search of explanatory variables that best predict the output Y, in the regression Y = β0+β1X1+…+βdXd. • Generate a sequence γ1,..,S to estimate π(γ|Y,X), where Xa is part of the model Mγ if γa=1.

  6. Algorithms • Two phase algorithms: • Summarization • Model computation • Large scale processing with SQL and UDFs. • Hardware optimizations are considered.

  7. Efficient Computation of PCA • Covariance and Correlation matrices are computed from n, L, and Q. • SVD of the correlation or the covariance matrices are the principal components.

  8. Efficient Bayesian Variable Selection • Input table: (X1,…,Xd,Y) • Find Mγ, where Xγ={Xa:γa=1}. • Use Zellener’s G-prior to estimate π(γ|Y,X).

  9. Efficient Bayesian Variable Selection • Selection probabilities are calculated from n, L, and Q.

  10. SQL Optimizations • Data summarization is done with distributive aggregate functions. • Resulting table storing terms in n, L, and Q.

  11. SQL Optimizations • UDFs to avoid column length limitations. • Records are packed in a user-defined type. • TVFs and SPs return tables with results for PCA and Bayesian variable selection.

  12. Hardware Optimizations • Multi-threading within TVFs and SPs, following API of DBMSs. • Caching blocks of records in main memory. • Working threads compute aggregations. • Control number of threads, memory used for caching, and workload waiting for a thread.

  13. Evaluation of Optimizations • Our optimizations show linear scalability for summarization.

  14. Comparison with R • Execution comparison of PCA inside the DBMS with the statistical package R working on flat files. • Our implementations have solutions for the cases when R reaches its data-size limitations.

  15. Comparison with R • Our optimizations for SSVS inside the DBMS show to be efficient to process large data. • Experimentation was done with a time limitation of 2K seconds, some experiments could not be executed due the data-size limitations.

  16. Conclusions • We have extended the DBMS functionality to include PCA and SSVS, using standard SQL and UDFs. • We have overcome limitations of external tools to process large data. • The model computation is not affected by the number of records; linear scalability on n and d. • Results are exact; accuracy not compromised.

  17. Future Work • Approach SSVS for high-dimensionality. • Alternative to UDFs. • Further research on multi-threads UDFs to solve a broader set of problems.

  18. Thank you!!

More Related