1 / 13

Efficient full-text search in databases

Efficient full-text search in databases. Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru. Search in databases?. Databases are continually growing “everyone” has got 1M records 10-1 00 M record databases are not that rare

ezra
Download Presentation

Efficient full-text search in databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

  2. Search in databases? • Databases are continually growing • “everyone” has got 1M records • 10-100M record databases are not that rare • 1B+ record databases which require full-text search do exist (most prominent example is Google) • Open-source DBMS are widely used • We will talk about MySQL • “The word on the street” is that other DBMSes have similar problems • Unfortunately, built-in solutions are not good enough for full-text search • And especially so, if there is something beyond “just” full-text search required…

  3. Types of special requirements • “Just”search is a key requirement, but… • Amazing, but it happens rather rarely (in DBMS world) • Rather a Web-search engine task • Additional sorting is frequently required • On a value different from relevance – for instance, on product price • Additional filtering is frequently required • For instance, by product category, or posting author ID • Match grouping is frequently required • For instance, by date, or by data source (eg. site) ID • What do built-in solutions offer?

  4. Built-in MySQL FTS • Pro – built-in, updates “instantly” • Con – scales poorly • Con – ignores word positions • This causes ranking issues • This causes phrase search to be slow • Con – only1 FT index per query (columns…) • Con – does not interoperate with other indexes • I.e.WHERE, ORDER/GROUP BY, LIMITclauses would be handled separately and “manually” • Conclusion – it is often unacceptable

  5. External engines shootout • We tested a number of well-known (to us) open-sourcesolutions • Let the vendors advertise commercial solutions themselves  • MySQL FTS • mnoGoSearch, http://mnogosearch.org/ • Designed for Web, but can do databases too (htdb) • Lucene, http://lucene.apache.org/ • PopularJava full-text searchlibrary • Sphinx, http://sphinxsearch.com/ • Designed for full-text search in databases from day one

  6. Benchmarking results • ~3.5Mrecords, ~5 GBtext (from Wikipedia) • mnoGoSearchdropped out of a race • more details in EuroOscon‘2006 talk by Peter Zaitsev

  7. Existing solutions • mnoGoSearch • Con – indexing and searching time issues • FATAL – did not complete indexing 5 GBin 24hours • Lucene • Pro – “instant” index updates • Pro – wildcard, fuzzysearches • Con – integration cost (this is Java library) • Con – filtering implementation (searching speed) • Con – no support for grouping • Sphinx • Con – “monolithic”indexes • Pro – everything else 

  8. Sphinx – overview • External solution for database search • Two principal programs • Indexer, used for re-indexing FTindexes • Searchd,search daemon • Easy integration • Built-in support for MySQL, PostgreSQL • ProvidesAPIsfor PHP, Python, Perl, Ruby,etc • Provides MySQL Storage Engine • High speed • Indexing speed – 4-10 MB/sec • Searching speed– avg 20-30 ms/q @ 5 GB, 3.5M docs

  9. Sphinx – ideology • Indexes locally available databases • “A-la SQL” document structure supported from day one • Up to 256 full-text fields • Any amount of attributes (integer/timestamp/etc) • “Fast re-indexing instead of slow searching” • Non-updateable index format – was initially chosen to maximize searching speed • But then it turned out – that re-indexing is very fast, too • In case of partial updates – we can still use re-indexing“partial”(delta) indexes once per Nminutes

  10. Sphinx – searching • Quality • Always accounts for word positions, not just frequencies • Scalability • Up to 50-100 GB per 1 CPU • Supports distributed searches • Distributed indexes are fully transparent to client application • Examples • Boardreader.com –500M+ records, 550+GBtext, 12CPU cluster • Mininova.org – not manyrecords (less than 1M), but 2-3Msearches per day

  11. Sphinx –advanced features • Sorting • On any attribute combination, SQL-like syntax • Filtering matches with a condition • Performed at earliest possible searching stage – for speed • Attributes are always either kept in RAM, or copied multiple times all over the index in required order – for speed • Fun fact – sometimes full scan of all matches and filtering those on Sphinx side are times faster than corresponding MySQL SELECT query – and are used in production instead…

  12. Sphinx –advanced features • Grouping • On any attribute • Performed in fixed RAM • Performed approximately (!) • Performed quite efficiently (compared to MySQLetc) • Query words highlighting • Special service, which needs document bodies and the query passed to it • MySQL Storage Engine • Can be used for especially complex queries on MySQL side which can not be run fully on Sphinx side • Can be used to simplify integration

  13. Conclusions • Large and very large databases require external solutions for full-text search • There is a number of requirements to such solutions beyond “just” searching (filtering,grouping,etc) • There is a number of open-sourcesolutions with different degrees of matching these requirements • For most tasks, try Sphinx, http://sphinxsearch.com/

More Related