310 likes | 425 Views
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese. Shaohua Jiang, Yanzhong Dang Institute of Systems Engineering, Dalian University of Technology, China. 1.Introduction.
E N D
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of Systems Engineering, Dalian University of Technology, China
1.Introduction Text is one of the important communication tools by which people can exchange information and knowledge each other. Most text processing methods are based on word information. Word segmentation is the foundation of information processing for Chinese texts. Word segmentation determines the effect of information processing of Chinese texts.
Automatic word segmentation was put forward in the early 1980s. • In recent years many machine learning methods and statistical methods have been used to process text automatically on the basis of large scale electronic text corpus.
An automatic word segmentation method based on the frequency statistics of Chinese character strings (CCSs) and length descending is proposed in this paper. • We collect texts from the applications of scientific projects. • This method needs not a previous study in the collection in order to get the probability information between different Chinese characters, so it is a real-time method.
2. Background of automatic segmentation for Chinese text • The existing segmentation methods for Chinese text can be divided into several categories areas as follows: • The method based on the dictionary. • The method based on syntax and rules. • The method based on statistics, for example the N-gram method. • The integrated method with the above methods.
The method based on the dictionary is the most basic automatic segmentation method for Chinese text adopted by many researchers. • It requires a dictionary constructed by domain experts when segmenting. • But constructing such a dictionary is time-consuming, it often takes experts many years. • Maintaining the dictionary is also a difficult task for new terms continuously appearing. • Moreover, there inevitably exist many conflicts due to the experts’ subjectivities and discipline fusion.
The method based on syntax and rules makes syntax and semantic meaning analysis in the same time when segmenting words. • It utilizes syntax and semantic information to carry out part of speech tagging and solve the segmentation ambiguity problem. • The existing syntax knowledge and rules are too general and complex to avoid conflicts between them with their quantity’s increasing.
To conquer the disadvantage of the method based on dictionary, based on syntax and rules, N-gram model was proposed which is a statistical language model. • The N-gram model assumes that the word occurrence probability is only related to the first N-1 words before itself, and irrelevant to any other words. • In other words, this assumption reflects the related information between N continuous words.
Owing to the limit of computing complexity in real application, the N-gram model often takes into account several historical information and forms models like bigram and trigram. • the N-gram model has three main shortcomings: (1) It cannot consider all newly occurring words in the training corpus. (2) The computing cost is very high. And the hardware resources may not satisfy this need. (3) CCS in N-gram model has less semantic meanings.
The method which integrates parts of the above methods has some advantages, however it still can’t avoid the shortcomings of each individual part fundamentally.
A new method is proposed here to overcome the shortcomings just mentioned. It extracts CCS whose support degree is bigger than a predefined value automatically and can avoid the wrong statistics of shorter CCS’s which is included in a longer one. • This method bases on the idea of length descending of CCS and needs not learning in advance, constructing dictionary by domain experts, and Chinese characters index.
3.The proposed algorithm • Chinese language has many very complicated linguistic problems and is quite different from western language. The main properties of Chinese are as follows : • 1) Chinese is a language of big characters. One Chinese word includes two characters, but western language includes only one character. • 2) The sentence in Chinese text is a continuous string. There are no blanks inside it. • 3) Chinese can be divided into five syntax units: morpheme, word, phrase, sentence and sentence set. • 4) Word form in written Chinese keeps the same on the whole.
About the basic processing unit in Chinese, being a word or a phrase is still a controversial problem . • “Word” is defined as the smallest language element with semantic meaning, which can be used independently. But the single word is much general and lacking of real semantic meaning. • The phase has a steady structure, so it should be used as the basic processing unit.
Main features of the word in Chinese texts are: 1) If a continuous CCS has a high frequency, the possibility of being a word is high too. 2) CCS which has a certain semantic meaning can be a word. 3) The combination mode of Chinese characters is observable from the statistical point of view. 4) The short word with high frequency is function-oriented. And the long word with low frequency is content-oriented.
Text processing is content-oriented, so a new Chinese text segmentation method is put forward in this paper. • The main idea is first segment the long terms (long CCSs) based on the statistical analysis and then shorten the length of the long ones step by step. • In short, the maximum frequency of CCS matching method. • The merits of the proposed algorithm are - needs not the dictionary - needs not the estimation of probability in advance.
4.The design of the algorithm • Theorem 1: The possibility of the CCS to be a word is lower when its segmentation time is higher. • Theorem 2: The possibility of the CCS to be a word is lower when the desired segmentation length of CCS is longer. • Theorem 3: The possibility of the CCS to be a word is lower when the amount of Chinese characters which are replaced by segmentation tags is larger.
(1) (2) (3)
Based on Formulas (1), (2) and (3),,f2is a descending function of M and. So the probability function of cooccurrence CCS is a descending function ofM,andLrespectively,f1andf2are both descending functions ofL.
The pseudocode of the automatic segmentation algorithm is as follows: • k = the selected maximal length of CCS • fp = the beginning position of processing CCS • sl = the predefined shortest CCS’ length
while k > sl do fp bp = the position of the first blank after fp; do tk = the CCS between fp and bp if ( tk’s length < k) start from the next CCS; else do tk = the CCS whose length is k started from fp if can’t match tk started from fp extract CCS whose length is k from the next Chinese characters; else extract the matched CCS; fp = fp + 1; bp = the position of the first blank after fp; k = k -2;
With regard to the time cost of this algorithm, let N be the total number of CCS after preprocessing, the time complexity in the worst case is , but the real time required is much less than this value. • This method doesn’t segment single Chinese character without reference to its frequency, because it is useless for classification and retrieval of text in practice. The phrase therefore used as the basic processing unit.
Since the semantic CCS has more semantic meanings than phrase does, it also should be used as the basic processing unit together with phrase. • The prior order for extracting is semantic CCS the first and phrase the second.
5.Results of experiment and discussion • 5.1 Design of experiment • The experimental corpus, i.e. the application of scientific project is summarized in table below:
5.2 Results of experiment • In the environment of Windows 2000 operating system, AMD ATHLON 1800++ cpu and 256M memory. • Total 4463 and 2338 CCSs are extracted from two subjects, the corresponding segmentation time are 128 and 32 seconds respectively. • From statistical results, few CCSs whose lengths are larger than 13 can be found. So such CCSs are belonging to the same class.
Fig. 1. Percent of different Chinese characters in all Chinese characters
The percent values of the segmented CCSs in the original texts’ Chinese characters are 76.13% and 78.25% for the subjects information and management respectively. • So most of CCSs in the original texts can be segmented by the proposed method.
From figure 1, we can see that the percent of the CCS whose length is 2 is the highest, which is about 37% and 41% corresponding to two subjects information and management respectively. • The percent of the CCS whose length is 3 is about 18% and 13% for two subjects respectively, but the percent of the CCS whose length is 4 is about 24% and 27%.
The 2-length CCSs with the highest frequencies are not suitable for document modeling owing to their more general meanings. • The low frequency of 3-length CCSs means that such CCSs in the selected corpus are also not suitable for document modeling. • The segmenting results obtained from the training corpus are more useful for the following document processing.
In addition, the experiment is conducted on the single Chinese characters. The single Chinese character with high frequency has not real semantic meaning, so it’s rational not to processing it in this method. • The complete segmentation result can be obtained only from a mass of corpus. The CCS that has real semantic meaning and appears only once can’t be segmented in the limited corpus. Under this circumstance, it can be segmented manually.
6. Conclusion • An automatic segmentation method which needs not dictionary and learning in advance is put forward in this paper. • The semantic CCS is defined in this paper. Using the proposed algorithm, the semantic CCSs and phrases can both be segmented. • This work is beneficial to various applications, such as automatic classification, modeling, clustering and retrieval of Chinese text.