tantivy-pinyin

就像名字一样,这是一个 tantivy 的拼音分析器

Just like the name, this is a pinyin tokenizer of tantivy

Usage (用法)

add dependencies tantivy_pinyin = "0.1.0"

This is an example of pinyin tokenizer:

``` use tantivy::collector::{Count, TopDocs}; use tantivy::query::TermQuery; use tantivy::schema::*; use tantivy::{doc, Index, ReloadPolicy}; use tantivy::tokenizer::{PreTokenizedString, Token, Tokenizer}; use tempfile::TempDir;

use tantivy_pinyin::PinyinTokenizer;

fn pretokenizetext(text: &str) -> Vec { let mut tokenstream = PinyinTokenizer.tokenstream(text); let mut tokens = vec![]; while tokenstream.advance() { tokens.push(tokenstream.token().clone()); } tokens }

pub fn main() -> tantivy::Result<()> { let index_path = TempDir::new()?;

let mut schema_builder = Schema::builder();

schemabuilder.addtextfield("title", TEXT | STORED); schemabuilder.addtextfield("body", TEXT);

let schema = schema_builder.build();

let index = Index::createindir(&index_path, schema.clone())?;

let mut indexwriter = index.writer(50000_000)?;

// We can create a document manually, by setting the fields // one by one in a Document object. let title = schema.getfield("title").unwrap(); let body = schema.getfield("body").unwrap();

let titletext = "大多数知识,不需要我们记住"; let bodytext = "大多数知识,只需要认知即可";

// Content of our first document // We create PreTokenizedString which contains original text and vector of tokens let titletok = PreTokenizedString { text: String::from(titletext), tokens: pretokenizetext(title_text), };

println!( "Original text: \"{}\" and tokens: {:?}", titletok.text, titletok.tokens );

let bodytok = PreTokenizedString { text: String::from(bodytext), tokens: pretokenizetext(body_text), };

// Now lets create a document and add our PreTokenizedString let oldmandoc = doc!(title => titletok, body => bodytok);

// ... now let's just add it to the IndexWriter indexwriter.adddocument(oldmandoc)?;

// Let's commit changes index_writer.commit()?;

// ... and now is the time to query our index

let reader = index .readerbuilder() .reloadpolicy(ReloadPolicy::OnCommit) .try_into()?;

let searcher = reader.searcher();

// We want to get documents with token "Man", we will use TermQuery to do it // Using PreTokenizedString means the tokens are stored as is avoiding stemming // and lowercasing, which preserves full words in their original form let query = TermQuery::new( //Term::fromfieldtext(title, "liu"), Term::fromfieldtext(body, "xin"), IndexRecordOption::Basic, );

let (topdocs, count) = searcher.search(&query, &(TopDocs::withlimit(2), Count))?;

println!("Found {} documents", count);

// Now let's print out the results. // Note that the tokens are not stored along with the original text // in the document store for (score, docaddress) in topdocs { let retrieveddoc = searcher.doc(docaddress)?; println!("Document: {}", schema.tojson(&retrieved_doc)); }

Ok(()) } ```

Features

stop_words 中文停用词

Test

cargo test

附言

项目比较小,如果帮助到了你,给个 star 鼓励一下作者吧