330 likes | 524 Views
Session 1. Wharton Summer Tech Camp. AGENDA. 1. Intro 2. Unix Usage ( Hugh MacMullan ). Wharton Tech Camp: Basic Info. Not a required course - For enrichment. You reap what you sow No required HWs except for installing required software (However, there will be resources and exercises)
E N D
Session 1 • Wharton Summer Tech Camp
AGENDA 1. Intro 2. Unix Usage (HughMacMullan)
Wharton Tech Camp: Basic Info • Not a required course - For enrichment. • You reap what you sow • No required HWs except for installing required software (However, there will be resources and exercises) • 8/4~8/20 - MON, WED, FRI 1:30pm-3pm • Format – 60 min presentation + 30 min lab time. • BYOL - Bring your own laptop. • Class website- opim.wharton.upenn.edu/techcamp/ • Please register and do a short survey if you haven’t • Syllabus up • Links and Resources for learning
Thanks to • Wharton Computing: Hugh MacMullan(Director of Research Computing) and Alec Lamon (Senior Director, WhartonComputing) • Prof Eric Bradlowand Prof Noah Gans • Mallory Hiatt, Maggie Saia (Wharton Doctoral) • And all the Ph.D. coordinators for email blasts • Please give feedbacks and suggestions during the course.
Tech Camp Goal • Target Audience: Beginning or any Wharton PhD students. Or anyone who wants to get an overview of useful tools. • What: Explore both public and Wharton specific cutting-edge tools and resources available for empirical research in business. • Tools current doctoral students have asked about or are using. • Concerned with acquiring, cleaning, storing, and managing data • Several syllabus revisions • Usage: Look at the syllabus and come if you think a particular session is interesting. People’s background in tech/tool are more diverse and it is hard to please everybody.
What you guys want out of the course • Better awareness of existing data analysis tools • Tools for data scraping and coding • Overview of tools to develop skills in • Data analysis framework, not a big focus on details • Learn about resources available • Overview of programs and what each would be used for and some intro hands on examples would be great. • Text mining, Big Data Analytics • Requests on specific tools/languages – matlab, network mapping, SAS • ETC
Quick overview of topics • Getting familiar with Unix environment • Great Research-Oriented Language - Python • Regular Expression - Acquiring and Cleaning Data • Data Acquisition (Web, Company, Wharton, etc) • APIs and Scraping Web (Manually & Using Scrapy) • HPCC (Grid) - Guest Lecturer (Hugh MacMullan) • Big Data (Variety, Volume, Veracity, Velocity), Data Mining, and Empirical Business Research • R tips and tricks – Dr. Ari Friedman • Practical Natural Language Processing
Talks Planned • Hugh (Wharton Research Computing) will be talking about the grid! 8/11 • We will also have Ari Friedman – Health Management PhD / MD student give a talk on R – bit about software architecture + Rstudio+ dplyer
What this camp is not • You will not learn how to program. This takes long time and can only be done by yourself with a computer and google (or any other reference) and example codes. • This is not a course in programming • Courses that heavily use R: Almost all non-theory doctoral stat classes but in particular: STAT 541 (Buja) and 542 (Jensen) and Observational studies STAT 921 (Small). Take Econ econometric classes for theory and take stat econometric classes to actually apply (STAT 520, 521) • SAS: look at Paul Allison’s classes and a marketing class by Raghu Iyengar. • Matlab: Structural modeling class in Marketing and BEPP (Ulrich Doraszelski’s class). Also, CIS 520: intro to machine learning. • Python: CIS 530 intro to computational linguistics.
A random tip before we start • As you know, within university’s network, you can access many journals. You can do that anywhere in the world. • Read more here http://www.library.upenn.edu/proxy/ • Or put the following Before any journal URL and log on https://proxy.library.upenn.edu/login?url=
There are right tools for different jobs • Borrowing Emerson’s famous quote: Assertion of supremacy of one tool is a hobgoblin of little minds, different tools have comparative advantages whether they are convenience-power tradeoff, support-power tradeoff or etc. Be open-minded and try different tools. • “There is no silver bullet [software]” • Have one statistical tool & one programming tool mastered (e.g. R + Python), then spend some time learning other tools.
There are right tools for different jobs Then spend few days briefly learning many other tools - 80/20 rule applies
Case in Point - Statistical Software Once you put time into learning 2+, benefit of using multiple tools for a big project becomes obvious
Programming Language Hierarchy Taken from http://trycatch22.com
Programming Language Hierarchy Generally gets easier/faster to code (There are exceptions) Becomes less flexible at the cost of convenience Performance (speed) increases (There are exceptions)
Then there are languages written for jokes, if you are bored These are some Turing-complete languages ( concept in CS given to set of instructions or programming languages with particular definitional ability) • COW • LOLCODE • Brain#$%@ Generates Fibonacci sequence Prints out “Hello world”
Case in Point – NBER PAPERhttp://www.nber.org/papers/w20263 • They solve the stochastic neoclassical growth model using a many different languages. They implement the same algorithm, value function iteration with grid search, in each of the languages. • C++ and Fortran still the fastest. • C++ has won over Fortran due to compiler advances • Matlab is between 9 to 11 times slower than the best C++ executable • R runs between 475 to 491 times slower than C++.
Case in Point – NBER PAPERhttp://www.nber.org/papers/w20263
Software & Accounts • We will be using software and tools through out the course such as Python • You must install it and configure it for your own laptop - this can take a very long time due to compatibility, compiler versions, etc. • Wharton Research Computing is here to help • research-computing@wharton.upenn.edu
Immediately apply for • Wharton Grid Account • https://unix.wharton.upenn.edu/ • WRDS Account • https://wrds-web.wharton.upenn.edu/wrds/
Install and configure • Python • get 2.X • Simplest - get Enthought/Canopy packaged version of python - https://www.enthought.com/downloads/ • Terminal Emulator (for windows users) – putty, cygwin, etc • Choose your favorite text editor or IDE • Emacs, Vim, Textmate, notepad++, Eclipse, etc • Future Software/Package Requirements • NLTK package for python • R • RStudio
If you have questions • Feel free to ask me, I’ll point you to the right direction • Google & Stackoverflow.com combo is the ultimate weapon • Before going on to stackoverflow, make sure to do your research or you’ll get “RTFM + LMGTFY” • All computational tools are man-made -> someone somewhere wrote documentation or questions. Rare to find no answers online -> Ask on forums. • Wharton Computing – best for Wharton specific. Before you email them, check out the wiki first (https://wiki.wharton.upenn.edu/researchcomputing/).
Intro to Unix Hugh MacMullan Director of Research Computing Gavin Burris Senior Project Leader for Research Computing