A simple Chinese word segmentation
CLucene - a C + + search engine http://sourceforge.net/projects/clucene/
Traditions are based on the full-text retrieval database, Sql Server Oracle mysql are providing full-text retrieval, but these comparisons, not suitable for stand-alone or small applications (Mysql4.0 above can be used as integrated development), Mysql did not support Chinese.
It was learned later that the open-source Apache has a full-text search engine, and application of broad, and its Apache Lucene is the JAVA version of the full-text retrieval engine, performance is excellent, a java version Unfortunately, I think there has been in C or C + + version, and finally one day in http://sourceforge.net Amoy to a good Dongdong, Clucene! CLucene is C + + version of the full-text retrieval engine, fully transplantation in the Lucene, but not on the Chinese language support, and there is a lot of memory leaks,: P
Cluene Chinese do not support the sub-term, I wrote a simple Chinese word segmentation, probably thinking that the traditional two-morphology, because the Chinese word segmentation unlike such English language, an encounter spaces or punctuation is deemed The end of a word, so two points on the use of morphology, for example, is two points morphology: Beijing, into Beijing, Beijing City. This will be great to thesaurus, but is a simple segmentation method (over time I would like to introduce my Chinese word segmentation some thinking), of course, can not be retrieved in the importation of "Beijing" so not on retrieval, as long as the importation: "Beijing + + Beijing city," Beijing can be retrieved, although accuracy is not high, but for the simple word, but you will miss some words.
I shining Clucene the segmentation module, a ChineseTokenizer done, the module responsible for the segmentation work, I would like to write the main function
ChineseTokenizer.cpp:
* ChineseTokenizer Token:: next () (
While (! Rd.Eos ())
(
Char_t ch = rd.GetNext ();
If (isSpace ((char_t) ch)! = 0)
(
Continue;
)
/ / Read for Alpha-Nums and Chinese
If (isAlNum ((char_t) ch)! = 0)
(
Start = rd.Column ();
Return ReadChinese (ch);
)
)
Return NULL;
)
* ChineseTokenizer Token:: ReadChinese (const char_t prev)
(
Bool isChinese = false;
StringBuffer str;
Str.append (prev);
Char_t ch = prev;
If (((char_t) ch>>
& & (char_t) ch> = 0xa0)
IsChinese = true;
While (! Rd.Eos () & & isSpace ((char_t) ch) == 0)
(
Ch = rd.GetNext ();
If (isAlNum ((char_t) ch)! = 0)
(
/ / Math or English to a student spaces. Under a Chinese character or
/ / Is Chinese characters. Attending a phrase composed of Chinese characters, or read English or the end of space
If (isChinese)
(
/ / Chinese, and Chinese is ch
If (((char_t) ch>>
& & (char_t) ch> = 0xa0)
(
/ / To return to a Chinese
Str.append (ch);
Rd.UnGet ();
/ / Wprintf (_T ( "[% s]"), str);
Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]);
)
Else
(
/ / Are letters or numbers or spaces
Rd.UnGet ();
/ / Wprintf (_T ( "[% s]"), str);
Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]);
)
)
Else
(
/ / Non-Chinese
/ / Ch is Chinese characters
If (((char_t) ch>>
& & (char_t) ch> = 0xa0)
(
/ / Wprintf (_T ( "[% s]"), str);
Rd.UnGet ();
Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]);
)
Str.append (ch);
)
)
)
/ / Wprintf (_T ( "[% s]"), str);
Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: ALPHANUM]);
)
At the same time, the Chinese word segmentation does not support document flow can only support in the form of memory, because I used the rd.UnGet (); If it is paper, Hei hei, only half a byte, oh Back: P
Ah. First wrote here, too hasty today, and so I have time I put my CLucene improvements to the TOP.
Tags: Word








0 Comments to “A simple Chinese word segmentation”
No Comments. Send your comment.
Leave a Reply
You must be logged in to post a comment.