就像名字一样,这是一个 tantivy 的拼音分析器
Just like the name, this is a pinyin tokenizer of tantivy
add dependencies
tantivy_pinyin = "0.1.0"
This is an example of pinyin tokenizer:
``` use tantivy::collector::{Count, TopDocs}; use tantivy::query::TermQuery; use tantivy::schema::*; use tantivy::{doc, Index, ReloadPolicy}; use tantivy::tokenizer::{PreTokenizedString, Token, Tokenizer}; use tempfile::TempDir;
use tantivy_pinyin::PinyinTokenizer;
fn pretokenizetext(text: &str) -> Vec
pub fn main() -> tantivy::Result<()> { let index_path = TempDir::new()?;
let mut schema_builder = Schema::builder();
schemabuilder.addtextfield("title", TEXT | STORED); schemabuilder.addtextfield("body", TEXT);
let schema = schema_builder.build();
let index = Index::createindir(&index_path, schema.clone())?;
let mut indexwriter = index.writer(50000_000)?;
// We can create a document manually, by setting the fields // one by one in a Document object. let title = schema.getfield("title").unwrap(); let body = schema.getfield("body").unwrap();
let titletext = "大多数知识,不需要我们记住"; let bodytext = "大多数知识,只需要认知即可";
// Content of our first document
// We create PreTokenizedString
which contains original text and vector of tokens
let titletok = PreTokenizedString {
text: String::from(titletext),
tokens: pretokenizetext(title_text),
};
println!( "Original text: \"{}\" and tokens: {:?}", titletok.text, titletok.tokens );
let bodytok = PreTokenizedString { text: String::from(bodytext), tokens: pretokenizetext(body_text), };
// Now lets create a document and add our PreTokenizedString
let oldmandoc = doc!(title => titletok, body => bodytok);
// ... now let's just add it to the IndexWriter indexwriter.adddocument(oldmandoc)?;
// Let's commit changes index_writer.commit()?;
// ... and now is the time to query our index
let reader = index .readerbuilder() .reloadpolicy(ReloadPolicy::OnCommit) .try_into()?;
let searcher = reader.searcher();
// We want to get documents with token "Man", we will use TermQuery to do it // Using PreTokenizedString means the tokens are stored as is avoiding stemming // and lowercasing, which preserves full words in their original form let query = TermQuery::new( //Term::fromfieldtext(title, "liu"), Term::fromfieldtext(body, "xin"), IndexRecordOption::Basic, );
let (topdocs, count) = searcher.search(&query, &(TopDocs::withlimit(2), Count))?;
println!("Found {} documents", count);
// Now let's print out the results. // Note that the tokens are not stored along with the original text // in the document store for (score, docaddress) in topdocs { let retrieveddoc = searcher.doc(docaddress)?; println!("Document: {}", schema.tojson(&retrieved_doc)); }
Ok(()) } ```
stop_words 中文停用词
cargo test
项目比较小,如果帮助到了你,给个 star 鼓励一下作者吧