Automated Tests

Text-Sanitizer

Rust Crate to convert raw text bytes into valid std::str::String with plain ASCII encoding

Features

Motivation

Most Rust parsing libraries will bail out when fed with raw data that is not UTF-8 encoded like ISO-8859-15 Windows encoding and others or mixed-up encodings. \ Using Str::from_utf8_lossy() will break those data and includes linear back and forth parsing on byte level which introduces performance penality on bigger data.\ text-sanitizer does not depend on proper encoding detection and relies only on an internal customizable convertion map.

Usage

The sanitizer::sanitize_u8() function takes the raw data and creates a new valid UTF-8 std::str::String from it. ```rust

fn sparkle_heart() { //------------------------------------- // Test data is the Sparkle Heart from the UTF-8 documentation examples // which will be converted to " <3 ".

let vsparkle_heart = vec![240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];

let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &"");

println!("sparkle_heart: '{}'", srsout);

assert_eq!(srsout, "<3");

} Considering this example where the data in the center is corrupted somehow: This data cannot be parsed by normal _Rust_ libraries and the containing valid information would be lost. rust use text_sanitizer::sanitizer;

fn twoheartscenter() { //------------------------------------- // Test data contains 2 Sparkle Hearts but is corrupted in the center // According to the Official Standard Library Documentation at: // https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8 // this would produce a FromUtf8Error or panic the application // when used with unwrap()

let vsparkle_heart = vec![240, 159, 146, 150, 119, 250, 240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];

let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &" -d");

println!("sparkle_heart: '{}'", srsout);

assert_eq!(srsout, "<3w(?fa)<3");

} ```