1 / 25

Small-Corpus-Based Automatic Chinese Unknown Word Extraction

This paper from N.Y.U.S.T. presents a method for extracting unknown Chinese words efficiently and accurately from small documents by addressing statistical and rule-based drawbacks. The research focuses on identifying compound words and proper names.

mertie
Download Presentation

Small-Corpus-Based Automatic Chinese Unknown Word Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 國立雲林科技大學National Yunlin University of Science and Technology • Automatic Chinese unknown word extraction using small-corpus-based method • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Tao-Hsing Chang • Chia-Hoang Lee Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE

  2. Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Extracting possible unknown words • SPLR • Modification • Prefixed/suffixed, Compound word selection • Experiment • Conclusion • Opinion

  3. Motivation • N.Y.U.S.T. • I.M. • any Chinese character can either represent a word or be a part of other words • no blank between Chinese words for identifying the boundaries • some drawbacks- Statistics and Rules Based • “拍打皮卡丘” • “觀光協會”、”神奇寶貝”

  4. Objective • N.Y.U.S.T. • I.M. • Extract Chinese unknown words • efficiency • accuracy • words occur rarely • small size of document for training

  5. 1-1.Introduction • N.Y.U.S.T. • I.M. • unknown words which don’t exist in dictionary or vocabulary • Identifying the boundaries “拍打皮卡丘” “資料探勘非常有意思” • Semantic ambiguity “觀光協會”,”神奇寶貝”

  6. 1-2.Introduction • N.Y.U.S.T. • I.M. • Restrict scope for Particular types of the unknown words • ‘Prefixes/suffixes’ identify proper name • Hybrid method to estimate the probability • Identifying general unknown words difficultly • “熱鬧非凡”、”回味無窮”、”神奇寶貝” • “發生什麼”、”老師問問題”

  7. 1-3.Introduction • N.Y.U.S.T. • I.M. • Statistics-based methods • Small documents cause low accuracy • Develop a method • Advantage of the efficiency of statistics-based • Accuracy of identify when small size of document

  8. 2.Previous Works • N.Y.U.S.T. • I.M. • The proper name can’t be identified (compound word) • “中國國際商業銀行” • “中國”,”國際”,”商業”,”銀行” • Statistics-based method • occur frequency • PLU-based likelihood ration (PLR) • Not only efficient but also fast • Occur rarely can’t be extracted

  9. 3-1.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • Preprocessing • Retrieving possible character sequences • Maximum length of character sequences is limited • Eliminate stop words from character sequences • The frequently occurring character sequences are then regarded as possible unknown words.

  10. 3-2.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • sequence occur follows the subsequence, the sequence should not be unknown words • “去福利社” occur follow “福利社”, so “去福利社” isn’t a possible unknown word

  11. 3-3.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • Defined:

  12. 3-4.Extracting Possible Unknown Words • N.Y.U.S.T. • I.M. • “去福利社” 200 times • “福利社” 1000 times • SPLR(tp)= = Tolerate error coefficients

  13. 4.Modification • N.Y.U.S.T. • I.M. 1.one-charactered prefix(前綴) or suffix(字尾) “導師室” “導師” results in low SPLR of “導師室” 2.Familiar sequences “從教室裡衝出來” isn’t an unknown word but would be identified by simple SPLR method

  14. 4-1-1. Prefixed/Suffixed Word Revising • N.Y.U.S.T. • I.M. • Some words which contain the prefixed or suffixes have been collected by dictionaries which are available. • For example, an unknown word : • “總領隊” includes the prefix, “ocw + mcw” • “導師室” includes the suffix, “mcw + ocw”

  15. 4-1-2. Prefixed/Suffixed Word Revising • N.Y.U.S.T. • I.M. • The one-charactered prefixes/suffixes can be extracted in advance from available dictionaries.

  16. N.Y.U.S.T. • I.M.

  17. 4-2-1. Compound Word Selection • N.Y.U.S.T. • I.M. • Familiar sequence in the document: • includes one or more common words while the compound words consists of particular words • “從教室裡衝出來” consists of the common words “教室” and “出來” • “文具用品” 100 times • “文具” 100 times • “用品” 100 times

  18. 4-2-2. Compound Word Selection • N.Y.U.S.T. • I.M. ts is the word included by tp and not a one-charactered word is the threshold • A sequences consist of the common words, should not be possible unknown words

  19. 4-2-3. Compound Word Selection • N.Y.U.S.T. • I.M. • Familiar sequences and compound words can be differentiated efficiently • “神奇寶具” 200 times • “神奇” 230 times • “寶貝” 250 times • “發生什麼” 200 times • “發生” 2000 times • “什麼” 4000 times 200/230 200/2000

  20. 5.Experimtents • N.Y.U.S.T. • I.M. • Data set : 1,285 students essays • Theme: “Recess at School” • Characters: 470,665

  21. 5-1.Experimtents-SPLR • N.Y.U.S.T. • I.M.

  22. 5-2.Experimtents-Familiar • N.Y.U.S.T. • I.M.

  23. 5-3.Experimtents-prefixed/suffixed • N.Y.U.S.T. • I.M. • Prefixed or suffixed pattern in CKIP lexicon (中央研究院資訊科學研究所-中文知識庫小組)

  24. 6.Conclusion • N.Y.U.S.T. • I.M. • efficiency • accuracy • words occur rarely • small set of training corpus

  25. Opinion • N.Y.U.S.T. • I.M. • Information Retrieval • unknown Word • compound word • Semantic web

More Related