1 / 75

Data Extraction

Data Extraction. Road map. String Matching and Tree Matching Multiple Alignments Building DOM Trees Extraction Given a List Page: Flat Data Records Extraction Given a List Page: Nested Data Records Extraction Given Multiple Pages Summary. Some useful algorithms.

lmalloy
Download Presentation

Data Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Extraction

  2. Road map • String Matching and Tree Matching • Multiple Alignments • Building DOM Trees • Extraction Given a List Page: Flat Data Records • Extraction Given a List Page: Nested Data Records • Extraction Given Multiple Pages • Summary CS511, Bing Liu, UIC

  3. Some useful algorithms • The key is to finding the encoding template from a collection of encoded instances of the same type. • A natural way to do this is to detect repeated patterns from HTML encoding strings. • String edit distance and tree edit distance are obvious techniques for the task. We describe these techniques. CS511, Bing Liu, UIC

  4. String edit distance • String edit distance: the most widely used string comparison technique. • The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of: • (1) change a letter, • (2) insert a letter, and • (3) delete a letter. CS511, Bing Liu, UIC

  5. String edit distance (definition) CS511, Bing Liu, UIC

  6. Dynamic programming CS511, Bing Liu, UIC

  7. An example • The edit distance matrix and back trace path • alignment CS511, Bing Liu, UIC

  8. Tree Edit Distance • Tree edit distance between two trees A and B (labeled ordered rooted trees) is the cost associated with the minimum set of operations needed to transform A into B. • The set of operations used to define tree edit distance includes three operations: • node removal, • node insertion, and • node replacement. A cost is assigned to each of the operations. CS511, Bing Liu, UIC

  9. Definition CS511, Bing Liu, UIC

  10. Simple tree matching • In the general setting, • mapping can cross levels, e.g., node a in tree A and node a in tree B. • Replacements are also allowed, e.g., node b in A and node h in B. • We describe a restricted matching algorithm, called simple tree matching(STM), which has been shown quite effective for Web data extraction. • STM is a top-down algorithm. • Instead of computing the edit distance of two trees, it evaluates their similarity by producing the maximum matching through dynamic programming. CS511, Bing Liu, UIC

  11. Simple Tree Matching algo CS511, Bing Liu, UIC

  12. An example CS511, Bing Liu, UIC

  13. Schema Alignment: Three Steps [BBR11] Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables linkage, fusion to be semantically meaningful

  14. Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables domain specific modeling

  15. Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Identifies correspondences between schema attributes

  16. Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Specifies transformation between records in different schemas

  17. Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mediated schemas: automatically created by inspecting sources • Clustering of source attributes • Volume, varietyof sources → uncertainty in accuracy of clustering

  18. Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Example P-mediated schema MS • M1({name}, {hPhone, pPh}, {oPhone}, {hAddr, pAddr}, {oAddr}) • M2({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr, oAddr}) • M3({name}, {hPhone, pPh}, {oPhone}, {hAddr}, {pAddr}, {oAddr}) • M4({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr}, {oAddr}) • MS = {(M1, 0.6), (M2, 0.4)}

  19. Probabilistic Mappings [DHY07, DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Example mappings between M1 and S1 • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …) • G = {(G1, 0.6), (G2, 0.4)}

  20. Probabilistic Mappings S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Answering queries on P-mediated schema based on P-mappings • By table semantics: one mapping for all tuples in a table • By tuple semantics: different mappings are okay in a table

  21. Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …)

  22. Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

  23. Probabilistic Mappings: By Table Semantics Now consider query Q2: SELECT pAddr FROM MS Result of Q2, under by table semantics, across all possible worlds

  24. Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

  25. Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

  26. Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

  27. Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

  28. Probabilistic Mappings: By Tuple Semantics • Now consider query Q2: SELECT pAddr FROM MS • Result of Q2, under by tuple semantics, across all possible worlds • Note the difference with the result of Q2, under by table semantics

  29. Introduction to Hadoop • Hadoop Map/Reduce is • a java based software framework for easily writing applications • which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware • in a reliable, fault-tolerant manner.

  30. Hadoop Cluster Architecture Job submission node HDFS master Client JobTracker NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Slave node Slave node Slave node From Jimmy Lin’s slides

  31. Hadoop HDFS

  32. Hadoop Cluster Rack Awareness

  33. Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides

  34. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” From Jimmy Lin’s slides

  35. High-level MapReduce pipeline

  36. Detailed Hadoop MapReduce data flow

  37. Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A

  38. Word Count with MapReduce Doc 1 Doc 2 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 4 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

  39. Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A

  40. Calculating document pairwise similarity • Trivial Solution • load each vector o(N) times • load each term o(dft2)times Goal scalable and efficient solutionfor large collections From Jimmy Lin’s slides

  41. Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores From Jimmy Lin’s slides

  42. Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores reduce map From Jimmy Lin’s slides

  43. Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc From Jimmy Lin’s slides

  44. Inverted Indexing with MapReduce Doc 2 Doc 1 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

  45. Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 From Jimmy Lin’s slides

  46. 2 2 2 2 2 1 1 2 3 1 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama How to deal with the long list? 1 1 From Jimmy Lin’s slides

  47. Record Linkage for Big Data Slides from Luna Dong’s VLDB Tutorial

  48. Record Linkage: Three Steps [EIV07, GM12] Blocking Pairwise Matching Clustering • Record linkage: blocking + pairwise matching + clustering • Scalability, similarity, semantics

  49. Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Blocking: efficiently create small blocks of similar records • Ensures scalability

  50. Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Pairwise matching: compares all record pairs in a block • Computes similarity

More Related