encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).
Additionally, the mem
module provides various operations for dealing with
in-RAM text (as opposed to data that's coming from or going to an IO boundary).
The mem
module is a module instead of a separate crate due to internal
implementation detail efficiencies.
Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.
Specifically, encoding_rs does the following:
u16
/ char16_t
).u16
/ char16_t
) into a sequence of bytes in an Encoding
Standard-defined character encoding as if the lone surrogates had been
replaced with the REPLACEMENT CHARACTER before performing the encode.
(Gecko's UTF-16 is potentially invalid.)document.characterSet
.Additionally, encoding_rs::mem
does the following:
Please see the file named COPYRIGHT.
Generated API documentation is available online.
An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.
For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.
These bindings do not cover the mem
module.
There are currently three optional cargo features:
simd-accel
Enables SSE2 acceleration on x86 and x8664 and NEON acceleration on Aarch64. Requires nightly Rust. _Enabling this cargo feature is recommended when building for x86, x8664 or Aarch64 on nightly Rust._ The intention is for the functionality enabled by this feature to become the normal on-by-default behavior once explicit SIMD becames available on all Rust release channels.
Enabling this feature breaks the build unless the target is x86 with SSE2
(Rust's default 32-bit x86 target, i686
, has SSE2, but Linux distros may
use an x86 target without SSE2, i.e. i586
in rustup
terms), x86_64 or
Aarch64.
serde
Enables support for serializing and deserializing &'static Encoding
-typed
struct fields using Serde.
no-static-ideograph-encoder-tables
Makes the binary size smaller at the expense of ideograph encode speed for Chinese and Japanese legacy encodings. (Does not affect decode speed.)
The speed resulting from enabling this feature is believed to be acceptable for Web browser-exposed encoder use cases. However, the result is likely unacceptable for other applications that need to produce output in Chinese or Japanese legacy encodings. (But applications really should always be using UTF-8 for output.)
For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding.
Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
to memcpy
and UTF-16 to UTF-8 should be fast.)
Speed is a non-goal when encoding to legacy encodings. Encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.
Currently, by default, encoding_rs builds with limited encoder-specific accelation tables for GB2312 Level 1 Hanzi, Big5 Level 1 Hanzi and JIS X 0208 Level 1 Kanji. These tables use binary search and strike a balance between not having encoder-specific tables at all (doing linear search over the decode-optimized tables) and having larger directly-indexable encoder-side tables. It is not clear that anyone wants this in-between approach, and it may be changed in the future.
In the interest of binary size, Firefox builds with the
no-static-ideograph-encoder-tables
cargo feature, which omits
the encoder-specific tables and performs linear search over the
decode-optimized tables. With realistic work loads, this seemed fast enough
not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone
for testing) in the Web-exposed encoder use cases.
A framework for measuring performance is available separately.
It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly (currently 1.19.0). These are tested on Travis.
Additionally, beta and the oldest known to work Rust version (currently 1.15.0) are tested on Travis. The oldest Rust known to work is tested as a canary so that when the oldest known to work no longer works, the change can be documented here. At this time, there is no firm commitment to support a version older than what's required by Firefox, but there isn't an active plan to make changes that would make 1.15.0 no longer work, either.
A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.
usize
instead of u8
at a time).mem
module.mem
module.replacement
a label of the replacement
encoding. (Spec change.)Encoding::for_name()
. (Encoding::for_label(foo).unwrap()
is
now close enough after the above label change.)parallel-utf8
cargo feature.&'static Encoding
.Encoder::has_pending_state()
public.simd
crate dependency to 0.2.0.7F
correctly in ISO-2022-JP.Hash
for Encoding
.InputEmpty
correct precedence over OutputFull
when encoding
with replacement and the output buffer passed in is too short or the
remaining space in the output buffer is too small after a replacement.PartialEq
and Eq
for the CoderResult
, DecoderResult
and EncoderResult
types.Encoder::encode_from_utf16
. (Due to an oversight, it lacked the fix that
Encoder::encode_from_utf8
already had.)#[must_use]
.parallel-utf8
).simd-accel
is used.Encoding
from const
to static
to make the referents unique across crates that use the refernces.FOO_INIT
instances of Encoding
to allow
foreign crates to initialize static
arrays with references to Encoding
instances even under Rust's constraints that prohibit the initialization of
&'static Encoding
-typed array items with &'static Encoding
-typed
statics
.const
to work so that cross-crate usage keeps the referents unique.Cow
s from Rust-only non-streaming methods for encode and decode.Encoding::for_bom()
returns the length of the BOM.simd-accel
feature flag. (Requires
nightly Rust.)Encoder.encode_from_utf8_to_vec_without_replacement()
.Add Encoding.is_ascii_compatible()
.
Add Encoding::for_bom()
.
Make ==
for Encoding
use name comparison instead of pointer comparison,
because uses of the encoding constants in different crates result in
different addresses and the constant cannot be turned into statics without
breaking other things.
The initial release.