cang-jie(仓颉)

Build Status dependency status GitHub

A Chinese tokenizer for tantivy, based on jieba-rs.

As of now, only support UTF-8.

Example

rust let mut schema_builder = SchemaBuilder::default(); let text_indexing = TextFieldIndexing::default() .set_tokenizer(CANG_JIE) // Set custom tokenizer .set_index_option(IndexRecordOption::WithFreqsAndPositions); let text_options = TextOptions::default() .set_indexing_options(text_indexing) .set_stored(); // ... Some code let index = Index::create(RAMDirectory::create(), schema.clone())?; let tokenizer = CangJieTokenizer { worker: Arc::new(Jieba::empty()), // empty dictionary option: TokenizerOption::Unicode, }; index.tokenizers().register(CANG_JIE, tokenizer); // ... Some code

Full example