1 / 28

Programming for Engineers in Python

Programming for Engineers in Python. Lecture 4: Data Analysis. Autumn 2011-12. Lecture 3: Highlights. Simulation: power lines and rare diseases Plan before coding Using Modules Import Constants. Tuples. Fixed size Immutable (similarly to Strings)

fmata
Download Presentation

Programming for Engineers in Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Engineers in Python Lecture 4: Data Analysis Autumn 2011-12

  2. Lecture 3: Highlights • Simulation: power lines and rare diseases • Plan before coding • Using Modules • Import • Constants

  3. Tuples • Fixed size • Immutable (similarly to Strings) • What are they good for (compared to list)? • Simpler (“light weight”) • Staff multiple things into a single container • Immutable (e.g., records in database)

  4. Dictionaries (Hash Tables) keys values • Key – Value mapping • Fast! • Usage: • Database • Dictionary • Phone book

  5. Dictionaries (Cont.)

  6. Dictionaries (Cont.)

  7. Dict – Initiate Dictionary from a List

  8. Sorting Lists 8

  9. Sorting Lists by Two Criteria First priority length, second lexicographical order 9

  10. Today: Data Analysis Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. • Descriptive • Predictive

  11. Data Analysis Examples • Google • Stock market trends • Genome-disease association • Face recognition • Production yield • Business intelligence • Speech recognition • Text categorization

  12. Text Categorization / Document Classification

  13. How is it Done? • Manually  • Automatically • Gather document statistics • Measure how similar it is to documents in each category • Today we will collect word-statistics from several well known books

  14. Plan • Find data • Collect word statistics • Observe results

  15. Find Data • This might be the hardest task for many applications! • Project Gutenberg (http://www.gutenberg.org/) • Alice's Adventures in Wonderland (http://www.gutenberg.org/cache/epub/11/pg11.txt) • The Bible, King James version, Book 1: Genesis (http://www.gutenberg.org/cache/epub/8001/pg8001.txt)

  16. Reading a Book

  17. Flying

  18. Print Most Popular Words (High Level)

  19. Modular Programming • Top-down approach: first write what you plan to do and then implement the details • Clear for readers • Easy to debug and test • Easy to maintain

  20. PrintMostPopular Build Word-Occurrences Dictionary

  21. PrintMostPopular Sort Words by Occurrences ? http://docs.python.org/library/operator.html

  22. The Code

  23. Results

  24. And Now for Several Books

  25. Results The word “to” as an example Bible L. Carroll

  26. How is it Really Done? • Preprocessing (e.g., words to lower case, remove punctuation signs) • Word count • Enhance statistics • Discard stop words (e.g., and, of, a) • Stemming (e.g., go & went) • Synonyms (מילים נרדפות) • bigrams, trigrams • Similarity measures to existing documents / categories

  27. How is it Really Done? Categories Topics: http://www.cs.tau.ac.il/courses/pyProg/1112a/lectures/4/topics.rbb Categories Hierarchy: http://www.cs.tau.ac.il/courses/pyProg/1112a/lectures/4/rcv1.topics.hier.orig

  28. How is it Really Done? Enhance Statistics Stop words: http://www.cs.tau.ac.il/courses/pyProg/1112a/lectures/4/english.stop After processing: http://www.cs.tau.ac.il/courses/pyProg/1112a/lectures/4/lyrl2004-non-v2_tokens_test_pt0.dat

More Related