encoding_rs

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t).
Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko's UTF-16 is potentially invalid.)
Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8.
Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants.
Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8.
Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually.
Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet.
Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the standard library; hopefully will get upstreamed some day) and ASCII.

Additionally, encoding_rs::mem does the following:

Checks if a byte buffer contains only ASCII.
Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer contains only Latin1 code points (below U+0100).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior (suitable for checking if the Unicode Bidirectional Algorithm can be optimized out).
Combined versions of the above two checks.
Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
Converts UTF-8 and UTF-16 to Latin1 (if in range).
Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
Copies ASCII from one buffer to another up to the first non-ASCII byte.
Converts ASCII to UTF-16 up to the first non-ASCII byte.
Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

Integration with `std::io`

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Licensing

Please see the file named COPYRIGHT.

API Documentation

Generated API documentation is available online.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

These bindings do not cover the mem module.

Sample programs

Optional features

There are currently these optional cargo features:

`simd-accel`

Enables SSE2 acceleration on x86 and x8664 and NEON acceleration on Aarch64 and ARMv7. _Enabling this cargo feature is recommended when building for x86, x8664, ARMv7 or Aarch64._ The intention is for the functionality enabled by this feature to become the normal on-by-default behavior once portable SIMD becames part of stable Rust.

Enabling this feature breaks the build unless the target is x86 with SSE2 (Rust's default 32-bit x86 target, i686, has SSE2, but Linux distros may use an x86 target without SSE2, i.e. i586 in rustup terms), ARMv7 or thumbv7 with NEON (-C target_feature=+neon), x86_64 or Aarch64.

Used by Firefox.

`serde`

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

`fast-legacy-encode`

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options: * fast-hangul-encode * fast-hanja-encode * fast-kanji-encode * fast-gb-hanzi-encode * fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

`fast-hangul-encode`

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.