encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).
Additionally, the mem
module provides various operations for dealing with
in-RAM text (as opposed to data that's coming from or going to an IO boundary).
The mem
module is a module instead of a separate crate due to internal
implementation detail efficiencies.
Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.
Specifically, encoding_rs does the following:
u16
/ char16_t
).u16
/ char16_t
) into a sequence of bytes in an Encoding
Standard-defined character encoding as if the lone surrogates had been
replaced with the REPLACEMENT CHARACTER before performing the encode.
(Gecko's UTF-16 is potentially invalid.)document.characterSet
.Additionally, encoding_rs::mem
does the following:
std::io
Notably, the above feature list doesn't include the capability to wrap
a std::io::Read
, decode it into UTF-8 and presenting the result via
std::io::Read
. The encoding_rs_io
crate provides that capability.
Please see the file named COPYRIGHT.
Generated API documentation is available online.
An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.
For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.
These bindings do not cover the mem
module.
There are currently three optional cargo features:
simd-accel
Enables SSE2 acceleration on x86 and x8664 and NEON acceleration on Aarch64 and ARMv7. _Enabling this cargo feature is recommended when building for x86, x8664, ARMv7 or Aarch64._ The intention is for the functionality enabled by this feature to become the normal on-by-default behavior once portable SIMD becames part of stable Rust.
Enabling this feature breaks the build unless the target is x86 with SSE2
(Rust's default 32-bit x86 target, i686
, has SSE2, but Linux distros may
use an x86 target without SSE2, i.e. i586
in rustup
terms), ARMv7 or
thumbv7 with NEON (-C target_feature=+neon
), x86_64 or Aarch64.
serde
Enables support for serializing and deserializing &'static Encoding
-typed
struct fields using Serde.
less-slow-kanji-encode
Makes JIS X 0208 Level 1 Kanji (the most common Kanji in ShiftJIS, EUC-JP and ISO-2022-JP) encode less slow (binary search instead of linear search) at the expense of binary size. (Does _not affect decode speed.)
Not used by Firefox.
less-slow-gb-hanzi-encode
Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode less slow (binary search instead of linear search) at the expense of binary size. (Does not affect decode speed.)
Not used by Firefox.
less-slow-big5-hanzi-encode
Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow (binary search instead of linear search) at the expense of binary size. (Does not affect decode speed.)
Not used by Firefox.
For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding.
Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
to memcpy
and UTF-16 to UTF-8 should be fast.)
Speed is a non-goal when encoding to legacy encodings. Encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.
In the interest of binary size, by default, encoding_rs does not have any encode-specific data tables. Therefore, encoders search the decode-optimized data tables. This is a linear search in most cases. As a result, encode to legacy encodings varies from slow to extremely slow relative to other libraries. Still, with realistic work loads, this seemed fast enough not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) in the Web-exposed encoder use cases.
See the cargo features above for optionally making Kanji and Hanzi legacy encode a bit less slow.
Actually fast options for legacy encode may be added in the future, but there do not appear to be pressing use cases.
A framework for measuring performance is available separately.
It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly (currently 1.25.0). These are tested on Travis.
Additionally, beta and the oldest known to work Rust version (currently
1.21.0) are tested on Travis. The oldest Rust known to work is tested as
a canary so that when the oldest known to work no longer works, the change
can be documented here. At this time, there is no firm commitment to support
a version older than what's required by Firefox. The oldest supported Rust
is expected to move forward rapidly when stdsimd
can replace the simd
crate without performance regression.
A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.
usize
instead of u8
at a time).--features simd-accel
work with stable-channel compiler to
simplify the Firefox build system.is_foo_bidi()
not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE
aka. BYTE ORDER MARK) as right-to-left.is_foo_bidi()
functions report true
if the input contains
Hebrew presentations forms (which are right-to-left but not in a
right-to-left-roadmapped block).convert_utf16_to_latin1_lossy
.mem
module assert that the input is in the range
U+0000...U+00FF (inclusive).mem
module provide conversions from Latin1 and UTF-16 to UTF-8
that can deal with insufficient output space. The idea is to use them
first with an allocation rounded up to jemalloc bucket size and do the
worst-case allocation only if the jemalloc rounding up was insufficient
as the first guess.simd-accel
-specific memory corruption introduced in
version 0.8.1 in conversions between UTF-16 and Latin1 in the mem
module.#[inline(never)]
annotation that was not meant for release.mem
module to increase the performance when
converting long buffers.mem
module.mem
module.replacement
a label of the replacement
encoding. (Spec change.)Encoding::for_name()
. (Encoding::for_label(foo).unwrap()
is
now close enough after the above label change.)parallel-utf8
cargo feature.&'static Encoding
.Encoder::has_pending_state()
public.simd
crate dependency to 0.2.0.7F
correctly in ISO-2022-JP.Hash
for Encoding
.InputEmpty
correct precedence over OutputFull
when encoding
with replacement and the output buffer passed in is too short or the
remaining space in the output buffer is too small after a replacement.PartialEq
and Eq
for the CoderResult
, DecoderResult
and EncoderResult
types.Encoder::encode_from_utf16
. (Due to an oversight, it lacked the fix that
Encoder::encode_from_utf8
already had.)#[must_use]
.parallel-utf8
).simd-accel
is used.Encoding
from const
to static
to make the referents unique across crates that use the refernces.FOO_INIT
instances of Encoding
to allow
foreign crates to initialize static
arrays with references to Encoding
instances even under Rust's constraints that prohibit the initialization of
&'static Encoding
-typed array items with &'static Encoding
-typed
statics
.const
to work so that cross-crate usage keeps the referents unique.Cow
s from Rust-only non-streaming methods for encode and decode.Encoding::for_bom()
returns the length of the BOM.simd-accel
feature flag. (Requires
nightly Rust.)Encoder.encode_from_utf8_to_vec_without_replacement()
.Add Encoding.is_ascii_compatible()
.
Add Encoding::for_bom()
.
Make ==
for Encoding
use name comparison instead of pointer comparison,
because uses of the encoding constants in different crates result in
different addresses and the constant cannot be turned into statics without
breaking other things.
The initial release.