A simple Chinese word segmentation

  CLucene - a C + + search engine http://sourceforge.net/projects/clucene/ 

  Traditions are based on the full-text retrieval database, Sql Server Oracle mysql are providing full-text retrieval, but these comparisons, not suitable for stand-alone or small applications (Mysql4.0 above can be used as integrated development), Mysql did not support Chinese. 
  It was learned later that the open-source Apache has a full-text search engine, and application of broad, and its Apache Lucene is the JAVA version of the full-text retrieval engine, performance is excellent, a java version Unfortunately, I think there has been in C or C + + version, and finally one day in http://sourceforge.net Amoy to a good Dongdong, Clucene!    CLucene is C + + version of the full-text retrieval engine, fully transplantation in the Lucene, but not on the Chinese language support, and there is a lot of memory leaks,: P 
  Cluene Chinese do not support the sub-term, I wrote a simple Chinese word segmentation, probably thinking that the traditional two-morphology, because the Chinese word segmentation unlike such English language, an encounter spaces or punctuation is deemed The end of a word, so two points on the use of morphology, for example, is two points morphology: Beijing, into Beijing, Beijing City.    This will be great to thesaurus, but is a simple segmentation method (over time I would like to introduce my Chinese word segmentation some thinking), of course, can not be retrieved in the importation of "Beijing" so not on retrieval, as long as the importation: "Beijing + + Beijing city," Beijing can be retrieved, although accuracy is not high, but for the simple word, but you will miss some words. 
  I shining Clucene the segmentation module, a ChineseTokenizer done, the module responsible for the segmentation work, I would like to write the main function 

ChineseTokenizer.cpp:

  * ChineseTokenizer Token:: next () ( 

  While (! Rd.Eos ()) 
  ( 
  Char_t ch = rd.GetNext (); 

  If (isSpace ((char_t) ch)! = 0) 
  ( 
  Continue; 
  ) 
  / / Read for Alpha-Nums and Chinese 
  If (isAlNum ((char_t) ch)! = 0) 
  ( 
  Start = rd.Column (); 

  Return ReadChinese (ch); 
  ) 
  ) 
  Return NULL; 
  ) 

  * ChineseTokenizer Token:: ReadChinese (const char_t prev) 
  ( 
  Bool isChinese = false; 
  StringBuffer str; 
  Str.append (prev); 

  Char_t ch = prev; 

  If (((char_t) ch>> 8) & & (char_t) ch> = 0xa0) 
  IsChinese = true; 

  While (! Rd.Eos () & & isSpace ((char_t) ch) == 0) 
  ( 

  Ch = rd.GetNext (); 

  If (isAlNum ((char_t) ch)! = 0) 
  ( 
  / / Math or English to a student spaces. Under a Chinese character or 
  / / Is Chinese characters. Attending a phrase composed of Chinese characters, or read English or the end of space 
  If (isChinese) 
  ( 
  / / Chinese, and Chinese is ch 
  If (((char_t) ch>> 8) & & (char_t) ch> = 0xa0) 
  ( 
  / / To return to a Chinese 
  Str.append (ch); 
  Rd.UnGet (); 
  / / Wprintf (_T ( "[% s]"), str); 
  Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]); 
  ) 
  Else 
  ( 
  / / Are letters or numbers or spaces 
  Rd.UnGet (); 
  / / Wprintf (_T ( "[% s]"), str); 
  Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]); 
  ) 
  ) 
  Else 
  ( 
  / / Non-Chinese 
  / / Ch is Chinese characters 
  If (((char_t) ch>> 8) & & (char_t) ch> = 0xa0) 
  ( 
  / / Wprintf (_T ( "[% s]"), str); 
  Rd.UnGet (); 
  Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: CHINESE]); 
  ) 
  Str.append (ch); 
  ) 
  ) 

  ) 
  / / Wprintf (_T ( "[% s]"), str); 
  Return new Token (str.getBuffer (), start, rd.Column (), tokenImage [lucene:: analysis:: chinese:: ALPHANUM]); 
  ) 

  At the same time, the Chinese word segmentation does not support document flow can only support in the form of memory, because I used the rd.UnGet (); If it is paper, Hei hei, only half a byte, oh Back: P 

  Ah.    First wrote here, too hasty today, and so I have time I put my CLucene improvements to the TOP. 

Bookmark it: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Google
  • DotNetKicks
  • DZone
  • Furl
  • Netvouz

Tags:

Releated Articles


0 Comments to “A simple Chinese word segmentation”

No Comments. Send your comment.

Leave a Reply

You must be logged in to post a comment.