1 / 31

Today’s Annual Review Meeting

Scalable Methods for the Analysis of Network-Based Data Annual Review Meeting Principal Investigator: Professor Padhraic Smyth Department of Computer Science University of California, Irvine Additional project information online at www.datalab.uci.edu/muri. Today’s Annual Review Meeting.

devi
Download Presentation

Today’s Annual Review Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Methods for the Analysis of Network-Based DataAnnual Review MeetingPrincipal Investigator: Professor Padhraic SmythDepartment of Computer ScienceUniversity of California, IrvineAdditional project information online at www.datalab.uci.edu/muri

  2. Today’s Annual Review Meeting • Goals • Review our research progress • Discussion, questions, interaction • Feedback from visitors • Format • Introduction • Research talks • 20 minute talks + 10 minutes/session for questions/discussion • Poster session • 1 to 2:30 (in this room) • Questions/discussion encouraged during talks • Several breaks Butts

  3. MURI Project Timeline • Initial 3-year period • May 1 2008 to April 30th 2011 • Funding arrived to universities in Oct 2008 • 2-year extension: • May 1 2011 to April 30th 2013 • Meetings (all at UC Irvine) • Kickoff Meeting, November 2008 • Working Meetings, April 2009, August 2009 • Annual Review, December 2009 • Working Meeting, May 2010 • Annual Review, November 2010 • Working Meeting, June 2011 • Annual Review, January 2012 • …….many various smaller meetings involving subsets of the research team

  4. Motivation 2007: interdisciplinary interest in analysis of large network data sets Many of the available techniques were descriptive, could not handle • Prediction • Missing data • Covariates, etc

  5. Motivation 2007: significant statistical body of theory available on network modeling Many of the available techniques did not scale up to large data sets, not widely known/understood/used, etc 2007: interdisciplinary interest in analysis of large network data sets Many of the available techniques were descriptive, could not handle • Prediction • Missing data • Covariates, etc

  6. Motivation 2007: significant statistical body of theory available on network modeling Many of the available techniques did not scale up to large data sets, not widely known/understood/used, etc 2007: interdisciplinary interest in analysis of large network data sets Many of the available techniques were descriptive, could not handle • Prediction • Missing data • Covariates, etc Goal of this MURI project Develop new statistical network models and algorithms to broaden their scope of application to large, complex, dynamic real-world network data sets

  7. Key Aspects of Our Technical Approach • Foundational statistical theory for network data • New methods to handle heterogeneous network data (with time, text, ..) • Efficient algorithms and data structures for scalable statistical estimation • Applications to large real-world data sets • Open-source software for others to build on

  8. Example: Network Dynamics in Classrooms Chris DuBois, Carter Butts, Padhraic Smyth, Dan McFarland (Stanford)

  9. Example: Email Communication Data Data: Count matrix of 200,000 email messages among 3000 individuals over 3 months Problem: Understand communication patterns and predict future communication activity Challenges: sparse data, missing data, non-stationarity, unseen covariates C. DuBois, J. Foulds, P. Smyth, ICWSM, 2011

  10. Example: Time Evolution of Emergency Responder Organizational Network for Hurricane Katrina C. T. Butts, R. Acton, and C. Marcum, Interorganizational collaboration in the hurricane Katrina response, Journal of Social Structure, 2010

  11. MURI Team

  12. Collaboration Network (Circa 2007) David Eppstein Carter Butts Mike Goodrich Mark Handcock Dave Mount Padhraic Smyth Dave Hunter

  13. Collaboration Network Ragupathyraj Vallyvan Lowell Trott Maarten Loffler Chris Marcum Emma Spiro Pavel Pszona Zack Almquist Darren Strash Ryan Acton Sean Fitzhugh Joe Simons Lorien Jasny David Eppstein Carter Butts Krista Gile Mike Goodrich Mark Handcock Ranran Wang Ian Fellows Miruna Petrescu-Prahova Dave Mount Padhraic Smyth Dave Hunter Eunhui Park Minkyoung Cho Jimmy Foulds Michael Schweinberger Romain Thibaux Pavel Krivitsky Chris DuBois Ruth Hummel Nick Navaroli Duy Vu Arthur Asuncion

  14. Collaboration Network Ragupathyraj Vallyvan Lowell Trott Maarten Loffler Chris Marcum Emma Spiro Pavel Pszona Zack Almquist Darren Strash Ryan Acton Sean Fitzhugh Joe Simons Lorien Jasny David Eppstein Carter Butts Krista Gile Mike Goodrich Mark Handcock Ranran Wang Ian Fellows Miruna Petrescu-Prahova Dave Mount Padhraic Smyth Dave Hunter Eunhui Park Minkyoung Cho Jimmy Foulds Michael Schweinberger Romain Thibaux Pavel Krivitsky Chris DuBois Ruth Hummel Nick Navaroli Duy Vu Arthur Asuncion

  15. Collaboration Network University of Utrecht Ragupathyraj Vallyvan RAND Lowell Trott Maarten Loffler Chris Marcum Emma Spiro Pavel Pszona Zack Almquist U Mass Amherst Computational Social Science Initiative Darren Strash Ryan Acton Sean Fitzhugh Joe Simons Lorien Jasny Intel David Eppstein Carter Butts Krista Gile Mike Goodrich Mark Handcock Ranran Wang Ian Fellows Miruna Petrescu-Prahova Dave Mount Padhraic Smyth Dave Hunter Eunhui Park Minkyoung Cho Jimmy Foulds Michael Schweinberger Romain Thibaux Pavel Krivitsky Facebook Chris DuBois Ruth Hummel Nick Navaroli Duy Vu Arthur Asuncion Google

  16. Mapping the Project Terrain Domain Theory Data Collection Network Modeling

  17. Mapping the Project Terrain Domain Theory Data Collection Data Structures and Algorithms Network Modeling Statistical Theory Inference Algorithms

  18. Mapping the Project Terrain Domain Theory Data Collection Data Structures and Algorithms Network Modeling Statistical Theory Inference Algorithms Hypothesis Testing Prediction/ Forecasting Decision Support Simulation

  19. Mapping the Project Terrain Domain Theory Data Collection Data Structures and Algorithms Network Modeling Statistical Theory Inference Algorithms Hypothesis Testing Prediction/ Forecasting Decision Support Simulation

  20. Statistical Network Modeling Approaches • Exponential Random Graph Models (ERGMs) • “Canonical” representation for statistical models of networks • Can model edge dependencies in very flexible ways • Fitting of the model can be computationally difficult • Latent Variable Models • Edges are conditionally independent given the latent variables • Can lead to much simpler estimation algorithms than regular ERGMs • Model interpretation can be difficult • Event-Based Models • Edges have time-stamps, models based on survival analysis • Surprisingly can be much easier to fit than models for “static” networks

  21. Impact: Software • R Language and Environment • Open-source, high-level environment for statistical computing • Default standard among research statisticians - increasingly being adopted by others • Estimated 250k to 1 million users • Statnet • R libraries for analysis of network data • New contributions from this MURI project: • Missing data (Gile and Handcock, 2010) • Relational event models (Butts, 2008-2011) • Latent-class models (DuBois, 2010) • Fast clique-finding (Strash, 2011) • + more……

  22. Impact: Publications • Approximately 60 peer-reviewed publications • across computer science, statistics, and social science • High visibility • Science, Butts, 2009 • Journal of the American Statistical Association,Schweinberger, in press • Annals of Applied Statistics, Gile and Handcock, 2010 • Journal of the ACM, da Fonseca and Mount, 2010 • Journal of Machine Learning Research, Asuncion, Smyth, etc, 2010 • Highly selective conferences • ACM SIGKDD 2010 (16% accept rate) • Neural Information Processing (NIPS) Conference 2009, 2011 (25% accepts) • IEEE Infocom 2010 (17.5% accepts) • Best paper and best poster awards • Cross-pollination • Exposing computer scientists to statistical and social networking ideas • Exposing social scientists and statisticians to computational modeling ideas

  23. Impact: Workshops and Invited Talks • 2010 Political Networks Conference • Workshop on Network Analysis • Presented and run by Butts and students Spiro, Fitzhugh, Almquist • Invited Talks: Conferences and Workshops • R!2010 Conference at NIST (Handcock, 2010) • 2010 Summer School on Social Networks (Butts) • Mining and Learning with Graphs Workshop (Smyth, 2010) • NSF/SFI Workshop on Statistical Methods for the Analysis of Network Data (Handcock, 2009) • International Workshop on Graph-Theoretic Methods in Computer Science (Eppstein, 2009) • Quantitative Methods in Social Science (QMSS) Seminar, Dublin (Almquist, 2010) • + many more….. • Invited Talks: Universities • Stanford, UCLA, Georgia Tech, U Mass, Brown, etc

  24. Impact: the Next Generation • Where students have gone… • Academia: University of Massachusetts, Karlsruhe, Utrecht • Research Labs/Industry: RAND, Google, Facebook • Students speaking at major conferences • Sunbelt International Social Network Meetings • Jasny, Spiro, Fitzhugh, Almquist, DuBois • American Sociological Association Meetings • Marcum, Jasny, Spiro, Fitzhugh, Almquist • 2010 ACM SIGKDD Conference (DuBois) • 2011 International Conference on Machine Learning (Vu) • 2011 Neural Information Processing Conference (Asuncion) • Only 20 talks selected for presentation out of 1400 submissions • Best paper awards or nominations (Spiro, Hummel, Almquist) • National fellowships: DuBois (NDSEG), Asuncion (NSF), Navaroli (NDSEG)

  25. …..and the Old Generation • Carter Butts • American Sociological Association, Leo A. Goodman award, 2010 • highest award to young methodological researchers in social science • David Eppstein • ACM Fellow, 2011 • Michael Goodrich • ACM Fellow, IEEE Fellow, 2009 • Mark Handcock • Fellow of the American Statistical Association, 2009 • Padhraic Smyth • ACM SIGKDD Innovation Award 2009 • AAAI Fellow 2010

  26. What Next? • Extending algorithmic advances into statistical modeling • Will allow us to scale existing algorithms to much larger data sets • Develop network models with richer representational power • Geographic data, temporal events, text data, actor covariates, heterogeneity, etc • Systematically evaluate and test different approaches • evaluate ability of models to predict over time, to impute missing values, etc • Apply these approaches to high visibility problems and data sets • e.g., online social interaction such as email, Facebook, Twitter, blogs • Make software publicly available

  27. SESSION 1: 9:20 External-Memory Network Analysis Algorithms for Naturally Sparse Graphs Michael Goodrich, Professor, Computer Science, UC Irvine 9:40New Models for Exponential Family Random Networks Ian Fellows, Phd student, Statistics, UCLA 10:00Set-Differencing Data Structures David Eppstein, Professor, Computer Science, UC Irvine 10:30 BREAK SESSION 2: 10:50Hierarchical Statistical Models for Event-Based Social Network Data Chris DuBois, Phd student, Statistics, UC Irvine 11:10Scalable Statistical Estimation Methods for Large Time-Varying Networks Dave Hunter, Professor, Statistics, Penn State 11:30 Large-Scale Social Network Analysis of Facebook Data Emma Spiro, Phd student, Sociology, UC Irvine

  28. 12:00 – 1:00 LUNCH • PIs + visitors at the University Club • Students + postdocs in 6011 • 1:00 to 2:30 POSTER SESSION with Phd students and postdoctoral fellows 2:30 – 3:40: SESSION 3 2:30 Order-StableParametrizations for ERGMs Carter Butts, Professor, Sociology, UC Irvine 2:50 ERGMs for Rank-Order Statistics • PavelKrivitsky, Postdoctoral Fellow, Statistics, Penn State • 3:10 Estimating the Size of Hidden Populations based on Partially-Observed Network Data • Mark Handcock, Professor, Statistics, UCLA 3:40 WRAP-UP, CLOSING COMMENTS (+ BEVERAGE BREAK) 4:00 ADJOURN 5:00 ADJOURN

  29. Logistics • All talks and posters in this room • Wireless • Restrooms

  30. Additional Resources Project Web site: http://www.datalab.uci.edu/muri/ Slides and Posters from AHM: http://www.datalab.uci.edu/muri/june2011/ Publications: http://www.datalab.uci.edu/muri/publications.php Software: http://csde.washington.edu/statnet/ Data Sets: http://networkdata.ics.uci.edu/resources.php

  31. Questions?

More Related