1 / 53

The PATENTSCOPE search system: CLIR

The PATENTSCOPE search system: CLIR. February 2013. Sandrine Ammann Marketing & Communications Officer. To the PATENTSCOPE search system webinar CLIR. Agenda. CLIR Definition History Search with CLIR Usefulness Golden rules Technicalities Q & A session. CLIR.

zamir
Download Presentation

The PATENTSCOPE search system: CLIR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer

  2. To the PATENTSCOPE search system webinar CLIR

  3. Agenda CLIR • Definition • History • Search with CLIR • Usefulness • Golden rules • Technicalities Q & A session

  4. CLIR Cross-Lingual Information Retrieval • Finds synonyms in different domains • Translates those found synonyms + original query into different languages

  5. NON-ASIAN Dutch English French German Italian Portuguese Russian Spanish Swedish ASIAN Chinese Japanese Korean CLIR – 12 languages available

  6. History

  7. History • Lower language barriers in patent search • First language tool developed in-house

  8. CLIR: the interface

  9. CLIR: precision vs recall • Precision = the ability to retrieve the most precise results. • Trying to find only precisely relevant items (high precision) = miss important items because they don't use quite the same vocabulary. • Recall = the ability to retrieve as many documents as possible that match or are related to a query. • Trying to find all the relevant items (high recall) = often get a lot of junk.

  10. CLIR: precision vs recall

  11. Example: precision

  12. Example: recall

  13. Example: ARM

  14. CHIP

  15. CLIR: supervised mode 2 modes: automatic and supervised Automatic: 1 step Supervised: 4 steps

  16. Cross-Lingual Expansion (CLIR)

  17. Result : the query from “container” to:

  18. Supervised mode: 1 of 4 steps

  19. Supervised mode : 2 of 4 steps

  20. Supervised mode : 3 of 4 steps

  21. Crowdsourcing "is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people and especially from the online community rather than from traditional employees or suppliers. […] Crowdsourcing is different from an ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific body." source: http://en.wikipedia.org/wiki/Crowdsourcing

  22. Supervised mode : 4 of 4 steps

  23. First: select languages

  24. Second: select parameters

  25. Stemming Process that removes common ending from words by English Porter algorithm electric¦al = electric electric¦ity = electric electron¦ics = electron

  26. Third: check variants

  27. Second: check variants

  28. Editing

  29. Checking: IPC

  30. Supervised mode: results

  31. Search examples: clothes for sport Entering “sports clothing” in the Simple search interface will return 168 results Entering “sports clothing” in the CLIR interface (in automatic mode) will return 5,449 results Entering “sports clothing” in the CLIR interface (in supervised mode) will return 1,023 results

  32. Why use CLIR? • Search full text collections simultaneously in many foreign languages B) Improve significantly the number of relevant results without increasing significantly the number of irrelevant results • 485 results in English titles or abstracts for “sports clothing” • 575 results obtained with CLIR searching in titles or abstracts in all languages C) Have confidence in your searches: No black box: users have access to the CLIR generated Boolean queries (albeit complex) and have the full control on them D) Have a responsive system even for complex queries

  33. Golden rules Expansion modes • Keyword very specific with only 1 meaning AUTO • For any other queries, SUPERVISED is recommended Variants/synonyms • Select words that you would like to appear in your search results • If you have too much noise in the result list, remove generic variant

  34. Golden rules Parameters • 1. Title and abstract: unconstrained distance • 2. Claims: sentence/paragraph distance • 3. Description: sentence/paragraph distance • Stemming recommended

  35. Technicalities • Compilation of a long list of titles in language pairs • Creation of in-house extraction methodology • Tool learns statistical bilingual dictionaries of titles ZH FR DE EN ES KO

  36. Technicalities • Quality of dictionaries: no human intervention • The more title available, the better the coverage Chinese Korean Dutch English Portuguese Italian French Russian Swedish German Spanish Japanese

  37. Technicalities • Disambiguation: process of identifying the sense of a word in a sentence. http://en.wikipedia.org/wiki/Disambiguation_%28disambiguation%29 Disambiguation is applied to keywords: • Technical domains based on the IPC • Synonyms selection

  38. Future plans • Improve terminology coverage of already supported languages • Add other languages: over 200’000 titles and abstracts with associated high quality translations in English

More Related