480 likes | 584 Views
Searching for Success. Amazon CloudSearch and Relational Databases. Agenda. Finding things Types of Databases Making Choices What is CloudSearch? Combining CloudSearch with Relational Sample Code. Finding Things. So Many Databases. Finding Your Information.
E N D
Searching for Success Amazon CloudSearch and Relational Databases
Agenda • Finding things • Types of Databases • Making Choices • What is CloudSearch? • Combining CloudSearch with Relational • Sample Code
Finding Things So Many Databases
Finding Your Information • Your users need to find things • What do you use? • A Database! • What Kind?
It's a Big World Out There! • "Database" != "Relational Database" • Tons of relational databases • Amazon RDS • MySQL • MSSQL • Oracle • but…
Many Other Types • NoSQL databases • Dynamo, Cassandra, CouchDB… • Graph databases • Neo4J, Titan, … • Column oriented databases • Redshift, Bigtable… • Text Search Engine • CloudSearch, Lucene, Autonomy...
Text Search Engine • Good at text queries • "Harry Potter and the Philosopher's Stone" Harry harry harry Potter potter potter and and and the the the Philosopher's philosopher philosopher's stone Stone stone harry potter philosopher stone
Text Search Engine • Basic element is the document • Documents are made of fields • "title" => "star wars" • Fields can be • Missing • Multi-valued • Variable length
Text Search Engine • Documents are not "normalized" • In a relational database • A movie table • A director table • An actor table • In CloudSearch • One document per movie
Text Search Engine Relational
Relevance • Key differentiator for text search • Not "does this match?" • "how WELL does this match? • Includes multiple factors • Term Frequency, Document Frequency, Proximity • Users can customize this • Distance • Popularity • Field Weighting
Text is more than "War & Peace" • It's not just books & blog posts • Meta-data • Author, Title, Category, Tags • Can include numbers: counts, dates, latitude,…
Making Choices Relational? CloudSearch?
Relational Database • Good at • Exact matches • Joins • Atomic Transactions • Not so good at • Relevance • How well does this match? • Handling words
Text Search Engines • Good at finding • Words, Phrases • Relevance • Not so good at • Joins • Transactions
Options for Search • Can I just use a relational database? • Yes. • Do I want to just use a relational database? • Probably not
Simple Approach • Widely supported, easy SELECT id, title FROM books WHERE title LIKE "%amazon%" • Does not perform well • Doesn't deal with multiple words
Text Extensions for Relational Databases • Vendor specific SELECT id,title FROM books WHERE MATCH(title) AGAINST('Harry Potter') IN NATURAL LANGUAGE MODE • Use different index structures • Typically MUCH less mature than relational code • More manual processes • Scaling, (if possible) • Managing • minimal relevance, no control
Options • Relational database • Weak relevance • Scaling & performance limits • Text Search Engine • No transactions & locking • No Joins • Both • Some extra effort, then best of both worlds
CloudSearch • Fully-managed text search engine • High Performance • Automatically Scaling • Reliable, Resilient • Based on Amazon Product Search
Search Features • Faceting • Complex queries • (and 'potter harry' (not author:'rowling')) • Configurable synonyms, stemming & stopwords • Custom Sorting/Ranking
Scaling • CloudSearch scales automatically • Handle your spikes • Plan for success, but don't spend until you need it • Handle more data • Scaling is seamless – no downtime
Automatic Scaling DATA Document Quantity and Size SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 TRAFFIC Search Request Volume and Complexity SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n
Easy to Use Queries • Rest API • Simple to add • Http Post • Simple to query • q=star trek • Simple to integrate • JSON HTTP CloudSearch HTTP Documents
Amazon CloudSearch Architecture AWS Query DNS / Load Balancing Search Domain Doc Svc API Command Line Tools Console Config API Command Line Tools Console Console Search API CONFIG SERVICE DOCUMENT SERVICE SEARCH SERVICE
What Can You Search For With CloudSearch? • Wine • Your college buddies • Curly hair products • Downton Abbey episodes • News in Bermuda • Playoff tickets • Online courses • Cat memes • Furniture • Doctor reviews • Take out food • Vacation rentals • Trademarks • African safaris • Kids arts & crafts • French dating/marriage • Online videos • Recipes • Weather insurance • Fashion news • Bollywood music • Stock art And more!
Combining the Two • Best of both worlds • Relational queries run on relational database • Text queries run on CloudSearch • Downside: Complexity • More moving parts • Synchronization
Synchronization • Which one is the master? • Usually the relational database • Updates • All at once • At regular intervals • When data is available • Deletes
Dataflow • One source • Simultaneous updates CloudSearch Loader Source RDBMS
Dataflow • One source • Two loaders CloudSearch RDBMS Loader Loader Source
Dataflow • One source • Log updates • Two loader CloudSearch RDBMS Loader Source Log Loader
Dataflow Source CloudSearch RDBMS Loader Source Source Log Loader
Dataflow • One source • Two loaders CloudSearch RDBMS Loader Loader Source
Java Example • Read from MySQL • JDBC – Nothing special • Post to CloudSearch • Apache HTTP Client
Libraries • Apache • HTTP Client • HTTP Core • Commons Logging • AWS Java SDK • MySQL connector
Source Files • CloudSearchRDS • Just does the setup for the demo • ExtractAndUpload • Does the main work • Batcher • Groups documents into batches • PosterHttp • Posts to CloudSearch
Main Loop ResultSetrs = stmt.executeQuery("select * from movies"); ResultSetMetaDatameta = rs.getMetaData(); for (int col = 1; col <= meta.getColumnCount(); col++) names.add(meta.getColumnName(col)); while (rs.next()) { int version = (int) (lastModified.getTime() / 1000); JSONObjectdoc = new JSONObject(); for (String name : names) { doc.put(name, rs.getString(name)); } String id = rs.getString("id"); if (batcher != null) { batcher.addDocument(doc, version, id); } }
SQL • select * from movies; • select key as id, title as name from movies • Denormalizing may require multiple queries
Search: It's not just for Relational Data • You can pull data from • S3 • Redshift • Web • Internal Documents • And more… • And make it searchable
Indexing S3 ListObjectsRequestlistObjectsRequest = new ListObjectsRequest().withBucketName(bucketName); ObjectListingobjectListing; do { objectListing = s3client.listObjects(listObjectsRequest); for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { processObject(objectSummary); } listObjectsRequest.setMarker(objectListing.getNextMarker()); } while (objectListing.isTruncated());
Summary • Use the right tool! • Text Search for Searching Text • CloudSearch is fully managed text search • Easy to get data from relational DB • Easy to load data into CloudSearch
Next Step: Free Trial • One month (750 hours) free. • Set up an account • Give it a try! • Questions? • TomHill@amazon.com