220 likes | 368 Views
Other formats for data. Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting. Prepare for RSA talk. Postings. Linked list. Big array for data Array of arrays: think of rows
E N D
Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting. Prepare for RSA talk. Postings
Linked list • Big array for data • Array of arrays: think of rows • Each row has information + one or more pointers to other rows. Various ways: • Forward pointing list: next item • Forward and back: next and previous item • Tree: first child item and next sibling • or first child, next sibling, parent • or first child, next sibling, parent1, parent2
Exercise • Make your family tree • each row has a name, parent1, (optionally include second parent), first child, next sibling • you need to start somewhere • Put down Not defined for things not in the table. • Put down -1 for cases of no children, no next sibling
Hash tables • Problem: how to find elements in a table? • no intrinsic order. If there was, you could use binary search. • Binary search: Compare value (or the key) to the middle value, if less than, search the lower half, if greater than, search the upper half, keep going… • Aside: Meyer family geography game
Hash table approach • Have key-value pairs. • Have task of finding if current key is in the table. • Assume there is a hash function that inputs the key and outputs the hash which corresponds to a slot in the table. • fixed time to compute the function • go to that spot. If empty, then store key-value there. If not empty, compare the keys, if it matches, then …. If not, check the next position, continue. • http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables
Associative array • Normal arrays use indices, typically starting with 0. • An associative array uses values. Consider a set of 4 products: table, desk, chair, lamp. An associative array could be used to store the prices: table=>100, desk=>150, chair=>50, lamp=>20
key-value pairs • so called key-value pairs is generalization of associative array and used in other systems. • At its most general, there can be more than one key-value for a given key and the basic software OR your program needs to take care of this situation.
JSON • http://www.json.org/ • Format (syntax) for information • smaller than XML • available in many language • name / value pairs • create using brackets. Use dot notation to access and modify • arrays • create using square brackets. Square brackets with indices to access and modify.
Example var course = {"name":"Topics", "teacher": "Jeanine Meyer", "days": "MR"}; course.name =>"Topics" course.teacher => "Jeanine Meyer" course.days => "MR"
Example var list = { "class_list": [ {"firstname":"Groucho", "lastname": "Marx"}, {"firstname":"Harpo", "lastname": "Marx"}, {"firstname":"Zeppo", "lastname": "Marx"}, {"firstname":"Curly", "lastname": "Stooge"} ]}; list[2].firstname => "Zeppo"
Big Data • buzz word more than specific product • Data that is • large in Volume • changes rapidly [or application requires up-to-date values] Velocity • different formats Variable • PLUS not necessarily all owned by the organization attempting to use it. • in this case, can only query, no changes/updates, deletions or additions
Note • A company / organization can store data in its own CLOUD (on servers) or cloud service offered by a vendor and still have total control. • Could even be relational database • Very large data bases, may be just key-value pairs
Cloud … can refer to one, some or all of the following • where the programs are • where the data is • where the processors (aka computers) are for doing the calculations
REST • Representational State Transfer • a "standard" / framework / style of communicating with Web services • typically, get information in the form of XML or JSON or something else • Posting opportunity: find a specific service that provides REST connections….
Parallel processing / distributed processing • Large amounts (volumes) of data • Multiple number of processors • How to speed up accomplishment of tasks? • Embarrassingly parallel refers to tasks that is easy to parallelize • Take a list of numbers (say, prices) and increase each by 10% • ?
What about • Tasks in which some parts can be done in parallel, but some cannot • How to devise ways to take advantage of multiple processors
Parallel exercise • Divide into groups of 5 • Each take a deck of cards • Shuffle • Devise plan to sort into order • suits hearts, spades, diamonds, clubs, • each suit A, 2, …. J, Q, K
Hadoop • open source utilities for distributed computing • http://hadoop.apache.org/ • Includes MapReduce
MapReduce A MapReduce job • map sets up tasks to be done in parallel • reduce combines the results • may be local combine step and then a reduce across all output steps • Requires a file system • Data is in key/value pairs
Applications • What are applications that using multiple processors for a [big] gain in speed?
Homework • Come up with improved parallel sorting • Postings: more on Hadoop, MapReduce, Big Data, etc.