1 / 43

DyDAn: Discovering Patterns in Dynamic Data for Homeland Security

DyDAn is an Institute for Discrete Sciences focused on developing technologies to analyze dynamic, nonstationary, massive datasets to find patterns and relationships. It will also nurture the future workforce through educational programs.

kbone
Download Presentation

DyDAn: Discovering Patterns in Dynamic Data for Homeland Security

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Background • DHS has established an Institute for Discrete Sciences (IDS). • Managed Out of Lawrence Livermore Nat. Lab. • DHS is establishing four “university affiliated centers” around the country. • One of these will be a “coordinating” UAC. • The Rutgers-based team has been designated as a UAC and was asked to become a coordinating UAC. • Other centers: Univ. of Illinois Urbana-Champaign, Univ. Southern Cal., U. Pittsburgh • In addition to 6 formal partners, we have told DHS that NJ University Consortium institutions will be involved.

  2. What is Discrete Science? • Discrete Science deals with • Patterns • Arrangements • Assignments • Schedules • Discrete Science • Seeks patterns in large amounts of data • Analyzes connections between entities such as people and groups • Develops efficient ways to quickly spot changes in standard patterns

  3. Why DyDAn? • Homeland Security requires inferences from massive flows of data, arriving continuously. • Buried in data are: quickly changing patterns. • DyDAn: will develop novel technologies to find patterns & relationships in dynamic, nonstationary, massive datasets. • DyDAn: will produce pioneering educational programs to nurture homeland security workforce of the future

  4. DyDAn Research • Information Management and Knowledge Discovery • Fundamental Topics in Discrete Mathematical Foundations • Two research themes: • Analysis of Large, Dynamic Multigraphs • Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data

  5. DyDAn Research I: Analysis of Large, Dynamic Multigraphs • Need to understand interactions between entities: people, objects, groups • Interactions often modeled as graphs • Linking nodes (entities) with edges (connections) • Multiple relationships between entities suggests multigraphs • Add new entities, new & changing connections suggests dynamic multigraphs. • Develop methods to represent, analyze, interrogate, & navigate dynamic multigraphs.

  6. DyDAn Research I: Analysis of Large, Dynamic Multigraphs

  7. DyDAn Research II: Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data • Need to understand massive amounts of data. • Data inherently distributed (multiple sources) • Data arrives rapidly – “continuously” • Seek anomalies, patterns, “emerging events” • Run continuous queries to monitor incoming data stream. • Data takes numerous forms; requires data mining methods that span the modalities.

  8. DyDAn Research Portfolio: Flexibility • 9 initial projects, 5 in Area I, 4 in Area II • Not all starting in year 1. • All leverage off previous work and additional funding from Rutgers. • Portfolio reviewed regularly with DHS, national lab partners, and other DHS centers; can readily change to newly-identified needs.

  9. DyDAn Research Portfolio: Large Graphs Projects • Universal Information Graphs (initial emphasis) • Adding Semantics to and Interconnecting Semantic Graphs (initial emphasis) • Analyzing Large, Dynamic Multigraphs Arising from Blogs • Algorithms for Identifying Hidden Social Structures • Statistical and Graph-theoretical Approaches to Time-Varying Multigraphs (Initial emphasis)

  10. DyDAn Research Portfolio: Dynamic Data Projects • Message Filtering and Entity Resolution • Continuous, Distributed Data Stream Modeling • Optimization and Data Analysis (Initial emphasis) • Dynamic Similarity Search in Multi-Modal Data

  11. DyDAn Data • Emphasis on publicly available data. • How to acquire, publish, analyze, store data in a private, secure way. • Privacy-preserving data analysis. • How to generate synthetic data sets that have the characteristics of real data but mask protected aspects. • Director of Data Analysis will work on all aspects of acquiring, sharing, publicizing analyzing data: privacy, legal, technical, etc.

  12. DyDAn Educational Programs • Great need to train people to work in homeland security. • Key DyDAn performers: record of integrating research and education from K-12 to postgraduate. • Integration of research and education: students in all research projects. • Integration of research and education: research themes into educational programs.

  13. DyDAn Educational Programs • Workshops, tutorials, shortcourses: most open to all • New courses, certificate programs, faculty training • Repository for information about homeland security courses nationally • New homeland security certificate programs: RPI, RU, TSU • Website to disseminate our models nationally • Program for national college faculty • Extensive program of “research experiences for undergraduates.” • Students from around the US in residence at DyDAn

  14. DyDAn Educational Programs • Internships/Visits • by students/faculty to national labs, corporate partner locations, and DyDAn. • by national lab, DHS, other UAC scientists to DyDAn • K-12 programs: • To build early awareness of educational and career opportunities in homeland security • Annual high school teacher “short course” in discrete math and homeland security

  15. Leadership as a Coordinating UAC • Building on extensive experience managing large, complex scientific & educational enterprises. • Based at DIMACS (Center for Discrete Mathematics and Theoretical Computer Science). • An original NSF “science and technology center” • 13 partner institutions (5 universities, 8 companies) • Large portfolio of research & educational programs with international scope

  16. A Resource for NJ • Connecting to the NJ Universities Homeland Security Research Consortium: Seek to involve all Consortium universities • Building on Relationships with State and Local Agencies • Advisory Committee: State and National Representatives • DyDAn Events open to NJ university, industry, and government participants and designed with their help. • Connecting NJ to DHS officials and efforts nationally.

  17. DyDAn Research • Information Management and Knowledge Discovery • Fundamental Topics in Discrete Mathematical Foundations • Two research themes: • Analysis of Large, Dynamic Multigraphs • Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data

  18. Project: Universal Information Graphs James Abello & Fred Roberts(Rutgers Univ.) Kiran Chilakamarri (Texas Southern University)Nate Dean(Texas State University- San Marcos)

  19. Web Market Baskets Internet Call Detail Attack Graphs Air Traffic Overview and Connection to Problems of Homeland Security • A variety of different massive data sources are available to analysts: Web, Internet, Calls, Email, Transportation, … • Problem: Coordinate information from multiple sources, to identify “interesting” collaborative information networks.

  20. Overview and Connection to Problems of Homeland Security • Model each data source as a large multidigraph • Edges give information • Too much information to actually fuse all these multidigraphs into one. • Challenge: Fuse collection of multidigraphs in useful ways. • .

  21. Project: Adding Semantics to and Interconnecting Semantic Graphs Alex Borgida (Rutgers University) Lila Ghemri (Texas Southern University) Peter F. Patel-Schneider (Bell Labs Research)

  22. Overview and Connection to Problems of Homeland Security • Information of interest to DHS is often stored using “shallow” representations. • Much of the information is in English tags • Susceptible to ambiguity, incompleteness, etc. • These representations are nonetheless very useful • Alleviate such shallowly represented information by augmenting with rich ontologies that describe and prescribe how a domain works • can discover information inherent in shallow information • can expose inconsistencies in shallow information • Problem - reasoning with rich information is computationally expensive.

  23. Planned Work through DyDAn • Extend OWL Web Ontology Language, a powerful ontology language for use with shallow information • Extend and specialize theory of Distributed Description Logics (DDLs), designed to limit interactions to lessen computational load • Develop and extend a highly-optimized reasoner to improve its performance with large amounts of shallow information • Study how dynamic change interacts with reasoners

  24. Project: Analysis of Large, Dynamic Multigraphs Arising from Blogs James Abello (Rutgers & Ask.com) Graham Cormode (AT&T Labs – Research)S. Muthukrishnan (Rutgers Univ.)

  25. Multigraphs in Security Applications • Intelligence data is well-modeled by large, evolving multigraphs • Nodes: entities Edges: connections • Many links between same pair of entities denote different interactions at different times • Relationships change (slowly, rapidly) over time. • Examples: • (User IDs, emails/telephone calls), • (Text reports/blogs/webpages, implicit/explicit links) • Our research: acquiring and analyzing multigraphs from different applications.

  26. Overview and Connection to Problems of Homeland Security • Blogs are an example of open source data • Large, highly-interconnected source of timely posts on observations, experiences, events, politics etc. from citizen observers (sloggers). • Chaotic source of information. What (mis)information is being propagated? • Challenging: find trustworthy sources. • Goal: develop techniques for labeling multigraphs in intelligence applications, apply them to blogs.

  27. Project: Algorithms for Identifying Hidden Social Structures in Virtual Communities Yuliy Baryshnikov (Bell Labs) Mark Goldberg (RPI) Malik Magdon-Ismail (RPI) William (Al) Wallace (RPI)

  28. Overview and Connection to Problems of Homeland Security • Prior to their acting, the perpetrators discuss and plan using a variety of communication media. • Challenge: Find hidden groups, coalitions and leaders by non-semantic analysis of large communication networks. • Ideal result: Find a suspicious group based on its pre-event communication activity, before they act. • Useful forensic result: Ex-postdiscovery of the relationship between the act and communication burst.

  29. Project: Statistical and Graph-Theoretical Approaches to Time-Varying Multigraphs Colin Goodall, AT&T Labs – Research Robert Bell, AT&T Labs – Research David Madigan, Rutgers University

  30. Overview and Connection To Problems of Homeland Security • A COI (Community of Interest) is an effective summary of significant connections in a graph. • Use COI for very large scale analysis of a dynamic graph: • Stark change in COI indicates an anomaly • Has an entity changed its id? • New cliques? • Goal:To analyze and apply automated anomaly detection to COI’s of dynamic multigraphs in telecomm, blogs, and intelligence data.

  31. Project: Message Filtering and Entity Resolution Endre Boros (Rutgers) Lila Ghemri (TSU) Tin Kam Ho (Bell Labs) Paul Kantor (Rutgers) David Madigan (Rutgers) Richard Mammone (Rutgers) Debasis Mitra (Bell Labs)

  32. Continuous Message Filtering and Entity Resolution in the Distributed Environment • Vast amount of data flow into or through monitoring points • Chaff must be discarded, meaningful messages and patterns of messages must be detected • in real time • with limited communication among the processing and monitoring nodes • with minimal interruption of normal communication and privacy • and maximum effectiveness

  33. Overview of Research Problem New Research • Work on automatically learning to identify topics, events, and actors in messages • Recognizing the same entity (an actor, a target, and organization) under different aliases. Multiple modeling and learning technologies Multiple optimization and combination technologies Operations 10 million messages a day. Billions of possible identifications Thousands of potentially important messages, identifications, etc.

  34. Connection to Problems of Homeland Security • Millions/ tens of millions of messages should be screened for patterns of interest • Actors often hide behind false identities • Reveal themselves by • Language • Style • Connections to other actors • Goal is to maximize screening effectiveness, minimize false positives – disruption to individuals and to commerce • Early, positive impact on detection of agents and organizations

  35. Project: Continuous, Distributed Data Stream Monitoring Moses Charikar (Princeton) Graham Cormode (AT&T Labs – Research) S. Muthukrishnan (Rutgers)

  36. Overview and Connection to Problems of Homeland Security • Data is massive, distributed, and evolving • inconvenient or impossible to collect together in one place • still need to monitor, identify patterns, correlate • Example: • monitoring streams of text: emails, blogs, newsfeeds, field reports. • identify patterns — profiles, clusters, outliers — that occur across multiple sites • Technical challenge: • must be accurate, avoid false positives. • optimize how (much) data is communicated between agents.

  37. Project: Optimization and Data Analysis Alexandre d'Aspremont (Princeton)Yuliy Baryshnikov (Bell Labs)Savas Dayanik (Princeton)Paul Kantor (Rutgers)Kai Li (Princeton)Warren Powell (Princeton)Seyed Roosta (TSU)

  38. Overview and Connection to Problems of Homeland Security • Dynamic data raises optimization issues: • Has rate of messages sent by X changed? • Has flow of cash into/out of organization Y changed? • Has there been unexpected change in travel plans? • View as learning problem; use optimal learning strategies. • Research challenges: • Optimal detection of changes in signals • Optimal recursive estimation • Rapid classification/pattern identification

  39. Project: Dynamic Similarity Search in Multi-Modal Data Moses Charikar, Perry Cook, Kai Li, and Olga Troyanskaya (Princeton) Ken Clarkson, Tin Kam Ho, and Haobo Ren (Bell Labs)

  40. Overview and Connection to Problems of Homeland Security • Data arising in homeland security comes from many modalities • Often such data are sensor data (audio, images, video, etc.) which are noisy and require similarity match and similarity search • Feature extractions are difficult and such features are high dimensional • Multi-modal data of interest are massive • Current content-based similarity search and classification are limited to small scale • “Curse of dimensionality”

  41. Overview and Connection to Problems of Homeland Security • How to build similarity search systems for multi-modal data is not well understood • How to manage and search at scale • How to integrate annotations/attributes based search with content-based search

  42. We are looking forward to collaborating with the DHS Institute for Discrete Sciences and to involving the NJ homeland security community in the new center

More Related