encoding_rs

Build Status crates.io docs.rs Apache 2 / MIT dual-licensed

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

Additionally, encoding_rs::mem does the following:

Integration with std::io

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Licensing

Please see the file named COPYRIGHT.

API Documentation

Generated API documentation is available online.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

These bindings do not cover the mem module.

Sample programs

Optional features

There are currently these optional cargo features:

simd-accel

Enables SSE2 acceleration on x86 and x8664 and NEON acceleration on Aarch64 and ARMv7. _Enabling this cargo feature is recommended when building for x86, x8664, ARMv7 or Aarch64._ The intention is for the functionality enabled by this feature to become the normal on-by-default behavior once portable SIMD becames part of stable Rust.

Enabling this feature breaks the build unless the target is x86 with SSE2 (Rust's default 32-bit x86 target, i686, has SSE2, but Linux distros may use an x86 target without SSE2, i.e. i586 in rustup terms), ARMv7 or thumbv7 with NEON (-C target_feature=+neon), x86_64 or Aarch64.

Used by Firefox.

serde

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

fast-legacy-encode

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options: * fast-hangul-encode * fast-hanja-encode * fast-kanji-encode * fast-gb-hanzi-encode * fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

fast-hangul-encode

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-hanja-encode

Changes encoding of Hanja into EUC-KR from linear search over the decode-optimized table to lookup by index. Since Hanja is practically absent in modern Korean text, this option doesn't affect perfomance in the common case and mainly makes sense if you want to make your application resilient agaist denial of service by someone intentionally feeding it a lot of Hanja to encode into EUC-KR.

Adds 40 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-kanji-encode

Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear search over the decode-optimized tables to lookup by index making Japanese plain-text encode to legacy encodings 30 to 50 times as fast as without this option (about 2 times as fast as with less-slow-kanji-encode).

Takes precedence over less-slow-kanji-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-kanji-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-kanji-encode

Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and ISO-2022-JP) encode less slow (binary search instead of linear search) making Japanese plain-text encode to legacy encodings 14 to 23 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-gb-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and gb18030 from linear search over a part the decode-optimized tables followed by a binary search over another part of the decode-optimized tables to lookup by index making Simplified Chinese plain-text encode to the legacy encodings 100 to 110 times as fast as without this option (about 2.5 times as fast as with less-slow-gb-hanzi-encode).

Takes precedence over less-slow-gb-hanzi-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-gb-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-gb-hanzi-encode

Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode less slow (binary search instead of linear search) making Simplified Chinese plain-text encode to the legacy encodings about 40 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-big5-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from linear search over a part the decode-optimized tables to lookup by index making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast as without this option (about 3 times as fast as with less-slow-big5-hanzi-encode).

Takes precedence over less-slow-big5-hanzi-encode.

Adds 40 KB to the binary size (20 KB compared to less-slow-big5-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-big5-hanzi-encode

Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow (binary search instead of linear search) making Traditional Chinese plain-text encode to Big5 about 36 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

Performance goals

For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding. These goals have been achieved.

Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent to memcpy and UTF-16 to UTF-8 should be fast.)

Speed is a non-goal when encoding to legacy encodings. By default, encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.

In the interest of binary size, by default, encoding_rs does not have encode-specific data tables beyond 32 bits of encode-specific data for each single-byte encoding. Therefore, encoders search the decode-optimized data tables. This is a linear search in most cases. As a result, by default, encode to legacy encodings varies from slow to extremely slow relative to other libraries. Still, with realistic work loads, this seemed fast enough not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) in the Web-exposed encoder use cases.

See the cargo features above for optionally making CJK legacy encode fast.

A framework for measuring performance is available separately.

Rust Version Compatibility

It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly (currently 1.29.0). These are tested on Travis.

Additionally, beta and the oldest known to work Rust version (currently 1.29.0) are tested on Travis. The oldest Rust known to work is tested as a canary so that when the oldest known to work no longer works, the change can be documented here. At this time, there is no firm commitment to support a version older than what's required by Firefox. The oldest supported Rust is expected to move forward rapidly when packed_simd can replace the simd crate without performance regression.

Compatibility with rust-encoding

A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.

Regenerating Generated Code

To regenerate the generated code:

Roadmap

Release Notes

0.8.13

0.8.12

0.8.11

0.8.10

0.8.9

0.8.8

0.8.7

0.8.6

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

0.7.2

0.7.1

0.7.0

0.6.11

0.6.10

0.6.9

0.6.8

0.6.7

0.6.6

0.6.5

0.6.4

0.6.3

0.6.2

0.6.1

0.6.0

0.5.1

0.5.0

0.4.0

0.3.2

0.3.1

0.3

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0

The initial release.