encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).
Additionally, the mem
module provides various operations for dealing with
in-RAM text (as opposed to data that's coming from or going to an IO boundary).
The mem
module is a module instead of a separate crate due to internal
implementation detail efficiencies.
Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.
Specifically, encoding_rs does the following:
u16
/ char16_t
).u16
/ char16_t
) into a sequence of bytes in an Encoding
Standard-defined character encoding as if the lone surrogates had been
replaced with the REPLACEMENT CHARACTER before performing the encode.
(Gecko's UTF-16 is potentially invalid.)document.characterSet
.Additionally, encoding_rs::mem
does the following:
std::io
Notably, the above feature list doesn't include the capability to wrap
a std::io::Read
, decode it into UTF-8 and presenting the result via
std::io::Read
. The encoding_rs_io
crate provides that capability.
For decoding character encodings that occur in email, use the
charset
crate instead of using this
one directly. (It wraps this crate and adds UTF-7 decoding.)
For mappings to and from Windows code page identifiers, use the
codepage
crate.
Please see the file named COPYRIGHT.
Generated API documentation is available online.
There is a long-form write-up about the design and internals of the crate.
An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.
For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.
These bindings do not cover the mem
module.
There's a write-up about the C++ wrappers.
There are currently these optional cargo features:
simd-accel
Enables SIMD acceleration using the nightly-dependent packed_simd
crate.
This is an opt-in feature, because enabling this feature opts out of Rust's guarantees of future compilers compiling old code (aka. "stability story").
Currently, this has not been tested to be an improvement except for these targets:
If you use nightly Rust, you use targets whose first component is one of the above, and you are prepared to have to revise your configuration when updating Rust, you should enable this feature. Otherwise, please do not enable this feature.
Note! If you are compiling for a target that does not have 128-bit SIMD
enabled as part of the target definition and you are enabling 128-bit SIMD
using -C target_feature
, you need to enable the core_arch
Cargo feature
for packed_simd
to compile a crates.io snapshot of core_arch
instead of
using the standard-library copy of core::arch
, because the core::arch
module of the pre-compiled standard library has been compiled with the
assumption that the CPU doesn't have 128-bit SIMD. At present this applies
mainly to 32-bit ARM targets whose first component does not include the
substring neon
.
The encoding_rs side of things has not been properly set up for POWER,
PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow
the advice from the previous paragraph, you probably shouldn't use
the simd-accel
option on the less mainstream architectures at this
time.
Used by Firefox.
serde
Enables support for serializing and deserializing &'static Encoding
-typed
struct fields using Serde.
Not used by Firefox.
fast-legacy-encode
A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.
At present, this option is equivalent to enabling the following options:
* fast-hangul-encode
* fast-hanja-encode
* fast-kanji-encode
* fast-gb-hanzi-encode
* fast-big5-hanzi-encode
Adds 176 KB to the binary size.
Not used by Firefox.
fast-hangul-encode
Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.
Adds 20 KB to the binary size.
Does not affect decode speed.
Not used by Firefox.
fast-hanja-encode
Changes encoding of Hanja into EUC-KR from linear search over the decode-optimized table to lookup by index. Since Hanja is practically absent in modern Korean text, this option doesn't affect perfomance in the common case and mainly makes sense if you want to make your application resilient agaist denial of service by someone intentionally feeding it a lot of Hanja to encode into EUC-KR.
Adds 40 KB to the binary size.
Does not affect decode speed.
Not used by Firefox.
fast-kanji-encode
Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear
search over the decode-optimized tables to lookup by index making Japanese
plain-text encode to legacy encodings 30 to 50 times as fast as without this
option (about 2 times as fast as with less-slow-kanji-encode
).
Takes precedence over less-slow-kanji-encode
.
Adds 36 KB to the binary size (24 KB compared to less-slow-kanji-encode
).
Does not affect decode speed.
Not used by Firefox.
less-slow-kanji-encode
Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and ISO-2022-JP) encode less slow (binary search instead of linear search) making Japanese plain-text encode to legacy encodings 14 to 23 times as fast as without this option.
Adds 12 KB to the binary size.
Does not affect decode speed.
Not used by Firefox.
fast-gb-hanzi-encode
Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and
gb18030 from linear search over a part the decode-optimized tables followed
by a binary search over another part of the decode-optimized tables to lookup
by index making Simplified Chinese plain-text encode to the legacy encodings
100 to 110 times as fast as without this option (about 2.5 times as fast as
with less-slow-gb-hanzi-encode
).
Takes precedence over less-slow-gb-hanzi-encode
.
Adds 36 KB to the binary size (24 KB compared to less-slow-gb-hanzi-encode
).
Does not affect decode speed.
Not used by Firefox.
less-slow-gb-hanzi-encode
Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode less slow (binary search instead of linear search) making Simplified Chinese plain-text encode to the legacy encodings about 40 times as fast as without this option.
Adds 12 KB to the binary size.
Does not affect decode speed.
Not used by Firefox.
fast-big5-hanzi-encode
Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from
linear search over a part the decode-optimized tables to lookup by index
making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast
as without this option (about 3 times as fast as with
less-slow-big5-hanzi-encode
).
Takes precedence over less-slow-big5-hanzi-encode
.
Adds 40 KB to the binary size (20 KB compared to less-slow-big5-hanzi-encode
).
Does not affect decode speed.
Not used by Firefox.
less-slow-big5-hanzi-encode
Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow (binary search instead of linear search) making Traditional Chinese plain-text encode to Big5 about 36 times as fast as without this option.
Adds 20 KB to the binary size.
Does not affect decode speed.
Not used by Firefox.
For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding. These goals have been achieved.
Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
to memcpy
and UTF-16 to UTF-8 should be fast.)
Speed is a non-goal when encoding to legacy encodings. By default, encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.
In the interest of binary size, by default, encoding_rs does not have encode-specific data tables beyond 32 bits of encode-specific data for each single-byte encoding. Therefore, encoders search the decode-optimized data tables. This is a linear search in most cases. As a result, by default, encode to legacy encodings varies from slow to extremely slow relative to other libraries. Still, with realistic work loads, this seemed fast enough not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) in the Web-exposed encoder use cases.
See the cargo features above for optionally making CJK legacy encode fast.
A framework for measuring performance is available separately.
It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly (currently 1.29.0). These are tested on Travis.
Additionally, beta and the oldest known to work Rust version (currently
1.29.0) are tested on Travis. The oldest Rust known to work is tested as
a canary so that when the oldest known to work no longer works, the change
can be documented here. At this time, there is no firm commitment to support
a version older than what's required by Firefox. The oldest supported Rust
is expected to move forward rapidly when packed_simd
can replace the simd
crate without performance regression.
A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.
To regenerate the generated code:
https://github.com/hsivonen/encoding_c
next to the encoding_rs
directory.https://github.com/hsivonen/codepage
next to the encoding_rs
directory.https://github.com/whatwg/encoding
next to the encoding_rs
directory.f381389
of the encoding
repo.encoding_rs
directory as the working directory, run
python generate-encoding-data.py
.usize
instead of u8
at a time).bincode
(dev dependency) version requirement to 1.0.simd
crate to packed_simd
.simd-accel
(README-only release).clippy::
prefix from clippy lint names.static
when defining another static
).is_single_byte()
on Encoding
.mem::decode_latin1()
and mem::encode_latin1_lossy()
.--features simd-accel
work with stable-channel compiler to
simplify the Firefox build system.is_foo_bidi()
not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE
aka. BYTE ORDER MARK) as right-to-left.is_foo_bidi()
functions report true
if the input contains
Hebrew presentations forms (which are right-to-left but not in a
right-to-left-roadmapped block).convert_utf16_to_latin1_lossy
.mem
module assert that the input is in the range
U+0000...U+00FF (inclusive).mem
module provide conversions from Latin1 and UTF-16 to UTF-8
that can deal with insufficient output space. The idea is to use them
first with an allocation rounded up to jemalloc bucket size and do the
worst-case allocation only if the jemalloc rounding up was insufficient
as the first guess.simd-accel
-specific memory corruption introduced in
version 0.8.1 in conversions between UTF-16 and Latin1 in the mem
module.#[inline(never)]
annotation that was not meant for release.mem
module to increase the performance when
converting long buffers.mem
module.mem
module.replacement
a label of the replacement
encoding. (Spec change.)Encoding::for_name()
. (Encoding::for_label(foo).unwrap()
is
now close enough after the above label change.)parallel-utf8
cargo feature.&'static Encoding
.Encoder::has_pending_state()
public.simd
crate dependency to 0.2.0.7F
correctly in ISO-2022-JP.Hash
for Encoding
.InputEmpty
correct precedence over OutputFull
when encoding
with replacement and the output buffer passed in is too short or the
remaining space in the output buffer is too small after a replacement.PartialEq
and Eq
for the CoderResult
, DecoderResult
and EncoderResult
types.Encoder::encode_from_utf16
. (Due to an oversight, it lacked the fix that
Encoder::encode_from_utf8
already had.)#[must_use]
.parallel-utf8
).simd-accel
is used.Encoding
from const
to static
to make the referents unique across crates that use the refernces.FOO_INIT
instances of Encoding
to allow
foreign crates to initialize static
arrays with references to Encoding
instances even under Rust's constraints that prohibit the initialization of
&'static Encoding
-typed array items with &'static Encoding
-typed
statics
.const
to work so that cross-crate usage keeps the referents unique.Cow
s from Rust-only non-streaming methods for encode and decode.Encoding::for_bom()
returns the length of the BOM.simd-accel
feature flag. (Requires
nightly Rust.)Encoder.encode_from_utf8_to_vec_without_replacement()
.Add Encoding.is_ascii_compatible()
.
Add Encoding::for_bom()
.
Make ==
for Encoding
use name comparison instead of pointer comparison,
because uses of the encoding constants in different crates result in
different addresses and the constant cannot be turned into statics without
breaking other things.
The initial release.