1 / 35

Destination Japan: Internationalization of the Lycos Search Engine

Destination Japan: Internationalization of the Lycos Search Engine. Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp. Lycos is. A new generation Web company - 4 top 20 Web properties in Network - Lycos, Tripod, Angelfire, HotBot A “hub”

paul2
Download Presentation

Destination Japan: Internationalization of the Lycos Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Destination Japan:Internationalization of the Lycos Search Engine Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp.

  2. Lycos is... A new generation Web company - 4 top 20 Web properties in Network - Lycos, Tripod, Angelfire, HotBot A “hub” Search Engine & Navigation - Patented search & directory technology Community & Communication E-commerce, Content Aggregation, Etc.

  3. The Search Technology Created by CMU professor (Fuzzy Mauldin) & students in 1994/95. 1. “Intelligent” spidering methods (now patented), but not internationalized. Spiders crawl the web retrieving documents for indexing. 2. Back-end database of webpages, or catalog, plus relevancy algorithms for ordering search results.

  4. First Stop: Europe • Lycos search technology initially for ASCII only. In-house work to make data paths 8-bit clean, to accommodate European languages. • Otherwise relatively straightforward. Components such as ad servers, Web servers, etc., require little if any changes. • Euro service came online in May 1997.

  5. What’s Unicode? Where’s Japan? • The more interesting problem. • Business reasons to introduce Japanese search. • But not a lot of international(ization) experience within Lycos at the time. • We needed assistance and chose Basis Technology.

  6. Goals • Quickdeployment of Japanese search • 1995 to 1997, Japanese Internet more than doubling each year • Marketing need to launch in Japan ASAP • Economical and efficient solution • Produce reusable internationalized code • Poise Lycos for even quicker deployment into other languages • Get "more bang for the buck"

  7. Two Main Functions of aSearch Engine • Building a catalogCompiling an indexed catalog of webpages from the Internet • Performing a queryDelivering a list of webpages matching certain keywords and parameters input by the user

  8. Japanese Issues for Catalog • Double-Byte: Japanese characters are double-byte. • Multiple encodings: Japanese webpages use 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP. • Options: Multiple vs. Single Catalog • Three catalogs: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an awkward and complicated solution to implement)OR • One catalog: all catalog data either in one Japanese encoding or in Unicode

  9. Single Catalog Options • A) Convert all data to one Japanese encoding • ISO-2022-JP, Shift-JIS, or EUC-JP • B) Convert all data to Unicode: • The quick andeconomicalchoice, Unicode is . . . • A superset of all scripts and character set encodings used on the Web, therefore reusable for other languages • More easily implemented into existing code originally written for processing single-byte ASCII

  10. The Unicode Plan • Use Unicode in catalog & internal processing • Because all electronic text on the Web maps cleanly into Unicode • Required elements: • Character encoding conversions Unicode webpage encodings (webpage encodings: Shift-JIS, ISO-2022-JP, and EUC-JP) • Encoding auto-detection • Japanese word breaking

  11. Encoding Conversion • Purpose: Convert data between encodings used on the Web and Unicode (which is still not used universally on the Web) • From 寿司 in Shift-JIS you want 寿司 in Unicode • Functionality provided by Basis Technology's Rosette embedded in Lycos code as source • Rosette is a cross-platform C++ library for Unicode; http://www.basistech.com/products/ • Complete set of mapping tables between Unicode and major legacy encodings • Conversions performed quickly and economically with minimal impact on performance

  12. Why Encoding Auto-Detection? • In order to convert text to another encoding, you have to know where you’re starting from. Or you could get . . . • Ex. Text in EUC-JP when viewed as other encodings. EUC-JP: 寿司 コンピュータ 花見 Shift-JIS: シハ ・ウ・ヤ・蝪シ・ソ イヨクォ ASCII:

  13. Encoding Auto-Detection • Purpose: to correctly identify encoding of webpage or query in order to convert properly from one encoding to another. • Functionality provided by Basis Technology's Rosette • Auto-detection on Japanese text in Shift-JIS, EUC-JP, or ISO-2022-JP encodings • Enhanced tiebreaker functionality to auto-detect very short strings (queries)

  14. Japanese Word Breaking • Purpose: To return indexable units (words) for creating an index, or for breaking the query into words to look up in the index. • Problem: Japanese words are not delimited by spaces • Solution: Basis Technology's Japanese Morphological Analyzer (http://www.basistech.com/products/) • Dictionary-based Japanese word breaking • Elimination of stop words (ex. “a”,”the”, etc.) • Looks for longest word match

  15. Selecting Unicode Representation (1) UCS2 characteristics • Depending on the task, either the UCS2 or UTF8 representation of Unicode was used in different parts of the Lycos search • Characteristics of UCS2 • Each coded character element is fixed width, 16 bits • Data paths must all accommodate 16 bits • Text in UCS2 is easy to manipulate and analyze (from a programming viewpoint)

  16. Selecting Unicode Representation (2) UTF8 characteristics • Characteristics of UTF8 • Each coded character is composed of one to six octets (one octet = 8 bits) • Data paths need only be "8-bit clean" • None of the octets in a multi-byte character are null (i.e., has the value of zero) • Text in UTF8 is difficult to manipulate or analyze. • "8-bit clean" = computer code which treats all 8 bits of a byte as significant. True of any computer code that processes European languages properly, but not necessarily true of code that processes only ASCII which only uses 7 bits per character.

  17. UCS2, UTF8, ASCII, etc.16-bit UCS2 can’t fit :(8-bit clean data pipe As UTF8 As UCS2 ASCII (7 bits) Latin character (8-bits) (w/diacritical) Japanese character (double-byte) (in Shift-JIS, EUC-JP etc.)

  18. Unicode in the Lycos System • UCS2: Japanese Morphological Analyzer from Basis Technology • Using UCS2 is the quick and economical way to process huge volumes of Japanese text. • UTF8: Lycos Catalog • Economy of disk space: ASCII is smaller in UTF8On the Web: ASCII 79%, double-byte Asian less than 5%, European encodings and others 16% • Ease of integration with existing code(a.k.a. transmissibility) • Based on the number of Web hosts on the Internet by country (total number of hosts for English-speaking domains as a percentage of the total number of hosts worldwide). Source: Survey by Network Wizards, http://www.nw.com

  19. Project Complete: Lycos Japan (1) • Quick:Prototype of Japanese search is produced in two months.Lycos Japan: http://www.lycos.co.jp • Beta version of Japanese search debuts July 1998; enters competitive Japanese search engine race in 4th place* • Upon formal launch grabs 2nd place in October 1998**According to Search Desk, http://www.searchdesk.com

  20. Project Complete: Lycos Japan (2) • E-conomical:Today, Lycos has spider, catalog and query software, which may easily be set to make catalogs in different languages by swapping in and out localized pieces: • Settings for target domains • Encoding detection and conversion calls • Language-specific word breaker (if needed)

  21. Q&A

  22. Q&A Questions? tina@basistech.com www.basistech.com jvanderclute@lycos.com www.lycos.com

  23. Q&A Questions? tina@basistech.com www.basistech.com jvanderclute@lycos.com www.lycos.com Thank you!

More Related