Rust Crate to convert raw text bytes into valid std::str::String
with plain ASCII encoding
unwrap()
Methods)\
Developed with the DevOps Mentalitity: "can fail but will live to tell"Most Rust parsing libraries will bail out when fed with raw data that is not UTF-8 encoded like ISO-8859-15 Windows
encoding
and others or mixed-up encodings. \
Using Str::from_utf8_lossy()
will break those data and includes linear back and forth parsing on byte level
which introduces performance penality on bigger data.\
text-sanitizer
does not depend on proper encoding detection and relies only on an internal customizable convertion map.
The sanitizer::sanitize_u8()
function takes the raw data and creates a new valid UTF-8 std::str::String
from it.
```rust
fn sparkle_heart() { //------------------------------------- // Test data is the Sparkle Heart from the UTF-8 documentation examples // which will be converted to " <3 ".
let vsparkle_heart = vec![240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];
let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &"");
println!("sparkle_heart: '{}'", srsout);
assert_eq!(srsout, "<3");
}
Considering this example where the data in the center is corrupted somehow:
This data cannot be parsed by normal _Rust_ libraries and the containing valid information would be lost.
rust
use text_sanitizer::sanitizer;
fn twoheartscenter() { //------------------------------------- // Test data contains 2 Sparkle Hearts but is corrupted in the center // According to the Official Standard Library Documentation at: // https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8 // this would produce a FromUtf8Error or panic the application // when used with unwrap()
let vsparkle_heart = vec![240, 159, 146, 150, 119, 250, 240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];
let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &" -d");
println!("sparkle_heart: '{}'", srsout);
assert_eq!(srsout, "<3w(?fa)<3");
} ```