1 / 33

Solr 3.1 and Beyond

Solr 3.1 and Beyond. Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010. Agenda. Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 Relevancy (Extended Dismax Parser) Spatial/Geo Search

Download Presentation

Solr 3.1 and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010

  2. Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 • Relevancy (Extended Dismax Parser) • Spatial/Geo Search • Search Result Grouping / Field Collapsing • Faceting (Pivot, Range, Per-segment) • Scalability (Solr Cloud) • Odds & Ends • Q&A

  3. Solr 3.1? What happened to 1.5? • Lucene/Solr merged (March 2010) • Single set of committers • Single dev mailing list (dev@lucene.apache.org) • Single shared subversion trunk • Keep separate downloads, user mailing lists • Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc) • Development • trunk is now always next major release (currently 4.0) • branch_3x will be base for all 3.x releases • Branch together, Release together, Share version numbers

  4. Relevance

  5. Extended Dismax Parser • Superset of dismax &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

  6. Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield-> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

  7. Spatial Search

  8. Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc

  9. Result Grouping /Field Collapsing

  10. Field Collapsing Definition • Field collapsing • Limit the number of results per category • “category” normally defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

  11. Field Collapsing by Site

  12. Result Grouping by Category Field Collapse on Product Type

  13. Group by Field "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}} http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

  14. Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}

  15. Grouping Params

  16. Faceting

  17. Pivot Faceting • Other names that could have made sense: • Grid Faceting, Cross-Product Faceting, Matrix Faceting • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock

  18. Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6

  19. Range Faceting • Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

  20. Existing single-valued faceting algorithm Documents matching the base query “Juggernaut” Lucene FieldCache Entry (StringIndex) for the “hero” field q=Juggernaut &facet=true &facet.field=hero 0 order: for each doc, an index into the lookup array lookup 2 lookup: the string values 7 flash, 5 5 (null) Batman, 3 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman Priority queue 0 increment 5 wolverine 0 2 0 1 2

  21. Per-segment single-valued algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 flash, 5 5 1 0 0 Base DocSet Batman, 3 2 0 0 4 7 thread4 thread3 1 thread2 2 FieldCache + accumulator merger (Priority queue) Priority queue thread1

  22. Per-segment faceting • Enable with facet.method=fcs • Controllable multi-threading facet.field={!threads=4}myfield • Disadvantages • Larger memory use (FieldCaches + accumulators) • Slower (extra FieldCache merge step needed) • Advantages • Rebuilds FieldCache entries only for new segments (NRT friendly) • Multi-threaded

  23. Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B *complete request time, measured externally

  24. Faceting Performance Improvements • For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement • Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster • Optimized deep facet paging – up to 10x faster with really large facet.offsets • Less memory consumed by field cache entries

  25. Scalability

  26. SolrCloud • First steps toward simplifying cluster management • Integrates Zookeeper • Central configuration (schema.xml, solrconfig.xml, etc) • Tracks live nodes + shards of collections • Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • Can specify logical shard ids shards=NY_shard,NJ_shard • Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true

  27. SolrCloud : The Future • Eliminate all single points of failure • Remove Master/Searcher distinction • Enables near real-time search in a highly scalable environment • High Availability for Writes • Eventual consistency model (like Amazon Dynamo, Cassandra) • Elastic • Simply add/subtract servers, cluster will rebalance automatically • By default, Solr will handle document partitioning

  28. Odds & Ends

  29. Auto-Suggest • Many people currently use terms component • Can be slow for a large corpus • New auto-suggest builds off SpellCheck component • Compact memory based trie for really fast completions • Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

  30. Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } } }'

  31. Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 • Can handle multi-valued fields (see “cat” field in example) • Completely compatible with the CSV update handler (can round-trip) • Results are streamed – good for dumping entire parts of the index

  32. http://localhost:8983/solr/browse

  33. Q&A

More Related