1 / 29

Daniel M. Dunlavy , Timothy M. Shead, Patricia J. Crossno Sandia National Laboratories

ParaText™ Leveraging Scalable Scientific Computing Capabilities for Large-Scale Text Analysis and Visualization. Daniel M. Dunlavy , Timothy M. Shead, Patricia J. Crossno Sandia National Laboratories SIAM Conference on Computational Science and Engineering March 2, 2009.

Download Presentation

Daniel M. Dunlavy , Timothy M. Shead, Patricia J. Crossno Sandia National Laboratories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ParaText™Leveraging Scalable Scientific Computing Capabilities for Large-Scale Text Analysis and Visualization Daniel M. Dunlavy, Timothy M. Shead, Patricia J. Crossno Sandia National Laboratories SIAM Conference on Computational Science and Engineering March 2, 2009 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2009-2391C 1/20

  2. Motivation Database Unstructured text Data analyst Few andoverworked Gigabytes to terabytes Processing and analysis Visualization Scalable: ParaText™ Scalable: VTK / ParaView / Titan 2/20

  3. Text Analysis Pipeline File readers (ASCII, UTF-8, XML, PDF, ...) Ingestion Tokenization, stemming, part-of-speech taggingnamed entity extraction, sentence boundaries Pre-processing Data modeling, dimensionality reduction, feature weighting, feature extraction/selection Transformation Information retrieval, clustering, summarization,classification, pattern recognition, statistics Analysis Visualization, filtering, summary statistics Post-processing Database, file, web site Archiving 3/20

  4. Vector Space Model • Vector Space Model for Text • Terms (features): • Documents (objects): • Term  Document Matrix: • : measure of importance of term in document • Term Examples • Sentence: “Danny re-sent $1.” • Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?] • n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, … • Named entities (people, orgs, money, etc.): danny, $1 • Document Examples • Documents, paragraphs, sentences, fixed-size chunks [G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM.] 4/20

  5. singular values documents concepts terms concepts Latent Semantic Analysis (LSA) documents … • SVD: • Boolean query (query as new “doc”): • Truncated SVD: • LSA query: d1 d2 d3 d4 dn t1 t2 . . . terms Truncated SVD tm [Deerwester, S. C., et al. (1990), “Indexing by latent semantic analysis.” J. Am. Soc. Inform. Sci.] 5/20

  6. ParaText™: Scalable Text Analysis • ParaText™ • Scalable client-server system • Leverages existing capabilities • ParaView, VTK, Trilinos • Latent semantic analysis • Term-document matrix analysis • Document similarities • Term (concept) similarities • Algebraic Engine • General object-feature analysis • General Use • OverView: open source information visualization • Timeline Treemap Browser: temporal information visualization • LSAView: text algorithm analysis • LDRDView: funding portfolio analysis • ParaSpace: simulation data analysis 6/20

  7. Leveraging VTK / ParaView • Useful capabilities already in VTK / ParaView • Pipeline architecture • Components for reading, processing, and visualizing (scientific) data sets • Distributed-memory parallel processing • Streaming data • ParaView client/server architecture • Informatics capabilities recently added by Titan • Data structures: tables, trees, graphs • Data storage: delimited text files, common graph formats, SQL databases • Algorithms: graph, statistics, Matlab and R integration • Visualization: informatics-oriented paradigms • Recent Titan features driven by ParaText™ • Dense & sparse N-way arrays for algebraic / tensor analysis • Unicode text storage, manipulation, and rendering • Text ingestion, LSA, creating/storing algebraic models, performing algebraic analyses 7/20

  8. Leveraging Trilinos • Useful capabilities already in Trilinos • Sparse matrix distributed data structure (Epetra) • SVD (Anasazi) • Incremental SVD (RBGen) • Load balancing (Isorropia/Zoltan) • Remaining issues for ParaText™ • Data layout: different needs throughout pipeline • Data storage: Epetra only provides doubles (Tpetra) • Data storage: compressed row storage • Target platforms: Linux focus (Autotools → CMake) 8/20

  9. OverView ParaText™ Client XMLHTTP Artifact DB Master ParaText™ Server ParaText™ Server (PTS) Database Servers P0 P1 Pk Matrices DB PTS PTS PTS Reader Reader Reader Parser Parser Parser Fullyscalable pipeline Matrix Matrix Matrix SVD SVD SVD HPC Resource (cluster, multicore server, etc.) 9/20

  10. ParaText™ Scaling Examples • Types of scaling • Strong: fixed data size, increasing # of processors • Weak: increasing data size, increasing # of processors • Fixed resource: increasing data size, fixed # of processors • Data • Subset of Spock Challenge Data (~10 Gb HTML documents) • 64 Mb, 3318 documents, 281904 unique terms • Hardware • 16 Dell 1850 Servers, Dual 3.6Ghz EM64T, 6GB RAM 10/20

  11. Strong Scaling 11/20

  12. Weak Scaling 12/20

  13. Weak Scaling by Filter 13/20

  14. Fixed Resource Scaling 14/20

  15. <DOC><DOCNO><s docid="APW19990519.0113" num="1" stype="-1">APW19990519.0113</s></DOCNO><DATE_TIME><P> <s docid="APW19990519.0113" num="2" stype="-1"> 1999-05-19 21:11:17</s> </P></DATE_TIME><BODY><CATEGORY><s docid="APW19990519.0113" num="3"stype="-1"> usa</s></CATEGORY><HEADLINE><P> <s docid="APW19990519.0113" num="4" stype="0"> Pulses May Ease SchizophrenicVoices</s> </P></HEADLINE><TEXT><P> <s docid="APW19990519.0113" num="5" stype="1"> WASHINGTON (AP)Schizophrenia patients whose medication couldn't stop the imaginary voices in theirheads gained some relief after researchers repeatedly sent a magnetic field into asmall area of their brains.</s> </P><P><s docid="APW19990519.0113" num="6"stype="1"> About half of 12 patients studied said their hallucinations becamemuch less severe after the treatment, which feels like ``having a woodpeckerknock on your head'' once a second for up to 16 minutes, said researcherDr.Ralph Hoffman.</s> <s docid="APW19990519.0113" num="7" stype="1">The voices stopped completely in three of these patients.</s> </P><P><sdocid="APW19990519.0113" num="8" stype="1"> The effect lasted for up toa few days for most participants, and one man reported that it lasted seven weeksafterbeing treated daily for four days.</s> </P><P><sdocid="APW19990519.0113" num="9" stype="1"> Hoffman stressed that thestudy is only preliminary and can't prove that the treatment would be useful.</s> <sdocid="APW19990519.0113" num="10" stype="1"> ``We need to do much moreresearch on this,'' he said in an interview.</s> </P><P><sdocid="APW19990519.0113" num="11" stype="1"> Hoffman, deputy medicaldirector of the Yale Psychiatric Institute, is scheduled to present the workThursday at the annual meeting of the American Psychiatric Association.</s> </P><P><s docid="APW19990519.0113" num="12" stype="1"> Not all people withschizophrenia hear voices, and of those who do, Hoffman estimated that maybe 25percent can't control them with medications even when other disease symptomsabate.</s> <s docid="APW19990519.0113" num="13" stype="1"> So the workcould pay off for ``a small but very ill group of patients,'' he said.</s> </P><P><sdocid="APW19990519.0113" num="14" stype="1"> The treatment is calledtranscranial magnetic stimulation, or TMS.</s> <s docid="APW19990519.0113"num="15" stype="1"> While past research indicates it mightbe helpful in lifting depression, it hasn't been studied much in schizophrenia.</s> </P><P><s docid="APW19990519.0113" num="16" stype="1"> In TMS, anelectromagnetic coil is placed on the scalp and current is turned on and off to create apulsing magnetic field that reaches into a small area of the brain.</s> <sdocid="APW19990519.0113" num="17" stype="1"> The goal is to make brain cellsunderneath the coil fire messages to adjoining cells.</s> </P><P><sdocid="APW19990519.0113" num="18" stype="1"> The procedure is muchdifferent from electroconvulsive therapy, called ECT, which applies pulses ofelectricity rather than a magnetic field to the brain.</s> <sdocid="APW19990519.0113"num="19" stype="1"> Unlike TMS, ECT creates a briefseizure and is performed under general anesthesia.</s> <sdocid="APW19990519.0113" num="20" stype="1"> ECT is used most often fortreatingsevere depression.</s> </P><P><s docid="APW19990519.0113" num="21"stype="1"> In TMS, the magnetic pulses are thought to calm the affected partof the brain if they're given as slowly as once per second, Hoffman said.</s> <s docid="APW19990519.0113" num="22" stype="1"> He and colleagues targeted an area involved in understanding speech, above and behind the left ear, on the theory that hallucinated voices come from overactivity there.</s> </P><P><s docid="APW19990519.0113" num="23" stype="1"> The treatment can make scalp muscles muscle contract, leading tothe woodpecker feeling, he said, but patients could tolerate it.</s> <s docid="APW19990519.0113" num="24" stype="1">Headachewas the most common side effect, and there was no sign that the treatment affected the ability to understandspeech, he said.</s> </P><P><s docid="APW19990519.0113" num="25" stype="1"> To make sure the study resultsdidn'treflect just the psychological boost of getting a treatment, researchers gave sham and real treatments to each studyparticipant and studied the difference in how patients responded.</s> <s docid="APW19990519.0113" num="26" starting with 15/20

  16. LSAView: Algorithm Analysis/ Development • LSAView • Analysis and exploration of impact of informatics algorithms on end-user visual analysis of data • Aids in discovery process of optimal algorithm parameters for given data and tasks • Features • Side-by-side comparison of visualizations for two sets of parameters • Small multiple view for analyzing 2+ parameter sets simultaneously • Linked document, graph, matrix, and tree data views • Interactive, zoomable, hierarchical matrix and matrix-difference views • Statistical inference tests used to highlight novel parameter impact • Used in developing and understanding ParaText™ algorithms 16/20

  17. LSAView 17/20

  18. LSAView Impact • Document similarities: • Inner product view: • Scaled inner product view: What is the best weighting of singular values for document similarity graph generation? original scaling no scaling inverse sqrt inverse 18/20 [Leisure Studies of America Data: 97 documents, 335 terms]

  19. Summary and Future Work • ParaText™ • Leverages VTK / ParaView / Trilinos (scientific computing) • Scalable infoviz / text analysis pipeline • General data object-feature analysis capability • More work ahead … • Scaling experiments: bigger data, layout constraints, etc. • Extensions: load balancing, incremental SVD, etc. • Other applications: clustering, summarization, classification, etc. • Relevance feedback: automatically incorporating user feedback to improve analysis (e.g, priors, metric learning) 19/20

  20. Thank You ParaText™: Leveraging Scalable Scientific Computing Capabilities for Large-Scale Text Analysis and Visualization Danny Dunlavy dmdunla@sandia.gov http://www.cs.sandia.gov/~dmdunla 20/20

  21. Backup Slides 21/20

  22. Feature Weighting Term  Document Matrix Scaling: 22/20

  23. A A q d1 d1 d2 d2 d3 d3 d4 d4 hurricane 1 hurricane hurricane 2 .89 .71 1 0 0 0 0 A2 d1 d2 d3 d4 earthquake 0 earthquake earthquake 0 0 0 0 1 1 .89 2 hurricane .78 .78 -.11 .11 catastrophe 0 catastrophe catastrophe .45 1 1 .71 0 0 .45 1 earthquake -.03 .02 .96 .92 catastrophe .59 .60 .15 .30 qTA2 .78 .78 – .11 qTA .89 .71 0 0 LSA Example 1 d1: Hurricane.Ahurricane is acatastrophe. d2 : An example of acatastrophe is a hurricane. d3 : An earthquake is bad. d4: Earthquake.An earthquake is acatastrophe. d1: Hurricane. A hurricane is a catastrophe. d2 : An example of a catastrophe is a hurricane. d3 : An earthquake is bad. d4: Earthquake. An earthquake is a catastrophe. Remove stopwords normalization only rank-2 approximation captures link to doc 4 23/20

  24. LSA Example 2 • policy • planning • politics • tomlinson • 1986 • Sport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) • Policy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) • Policy and Planning (II), ed J. Wilkinson (1986) • Policy and Planning (I), ed J. Wilkinson (1986) • Leisure: Politics, Planning and People, ed A. Tomlinson (1985) • parker • lifestyles • 1989 • part • Work, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) • Work, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989) 24/20 [Leisure Studies of America Data: 97 documents, 335 terms]

  25. Modeling Text • Model document meaning as a linear equation • Meaning (document) = j meaning (termj) • Induces a high dimensional vector space model • Create term-document (occurrence) matrix • Examples of “terms”: • Words • Word stems, lemmas • Entities (person, organization, location, etc.) • N-grams (characters or words) documents … d1 d2 d3 d4 dn t1 t2 . . . terms 25/20 tm

  26. Latent Semantic Analysis • Conceptual searching • rank(k)  : more exact data similarities • rank(k)  : more conceptual data similarities • Compute larger rank, then use smaller rank concepts concepts terms terms increasing k k = 24 k = 6 more exact more conceptual 26/20

  27. singular values documents concepts terms concepts DT  T ParaText™ Operations • Document parsing, matrix creation and weighting • SVD: • Truncated: • Query scores (query as new “doc”): • LSA Ranking: • Document similarities: • Term Similarities: 27/20

  28. Document Similarity Graphs • Document similarity matrix • Document similarity graph • Each document (or term, entity, etc.) is a vertex • Each row defines an edge singular values documents concepts sparse coordinate format concepts documents threshold 28/20

  29. Graph Similarities • Statistics on edges • One graph: one-sample t statistic • Two graphs: two-sample t statistic Edges from graph 1 Edges from graph 2 29/20

More Related