Rust library for OEM Code pages

CI (master) CI (Release) oem<em>cp at crates.io oem</em>cp at docs.rs

This library handles many SBCS (single byte character sets) that are used as OEM code pages. OEM code pages are used for encoding file names in ZIP archives and characters in the terminal in Windows today.

Supported code pages

| Code Page | Note | | --------- | ------------------------------------------------ | | 437 | OEM United States | | 720 | Arabic (Transparent ASMO); Arabic (DOS) | | 737 | OEM Greek (formerly 437G); Greek (DOS) | | 775 | OEM Baltic; Baltic (DOS) | | 850 | OEM Multilingual Latin 1; Western European (DOS) | | 852 | OEM Latin 2; Central European (DOS) | | 855 | OEM Cyrillic (primarily Russian) | | 857 | OEM Turkish; Turkish (DOS) | | 858 | OEM Multilingual Latin 1 + Euro symbol | | 860 | OEM Portuguese; Portuguese (DOS) | | 861 | OEM Icelandic; Icelandic (DOS) | | 862 | OEM Hebrew; Hebrew (DOS) | | 863 | OEM French Canadian; French Canadian (DOS) | | 864 | OEM Arabic; Arabic (864) | | 865 | OEM Nordic; Nordic (DOS) | | 866 | OEM Russian; Cyrillic (DOS) | | 869 | OEM Modern Greek; Greek, Modern (DOS) | | 874 | ANSI/OEM Thai (ISO 8859-11); Thai (Windows) |

Notes are quoted from https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

How to use

Add oem_cp to the dependencies in Cargo.toml in your projects.

```toml [dependencies]

snip

oem_cp = "1"

snip

```

Examples

Use specific code pages

Encoding Unicode string to SBCS bytes

```rust use oemcp::{encodingstringchecked, encodingstringlossy}; use oemcp::codetable::{ENCODINGTABLECP437, ENCODINGTABLE_CP737};

asserteq!(encodestringchecked("π≈22/7", &*ENCODINGTABLECP437), Some(vec![0xE3, 0xF7, 0x32, 0x32, 0x2F, 0x37])); // Archimedes in Greek asserteq!(encodestringchecked("Αρχιμήδης", &ENCODING_TABLE_CP737), Some(vec![0x80, 0xA8, 0xAE, 0xA0, 0xA3, 0xE3, 0x9B, 0x9E, 0xAA])); // ¾ (U+00BE) is not included in CP437 assert_eq!(encoding_string_checked("½+¼=¾", &ENCODINGTABLECP437), None); // Unknown characters can be replaced with ? (0x3F) asserteq!(encodingstringlossy("½+¼=¾", &*ENCODINGTABLE_CP437), vec![0xAB, 0x2B, 0xAC, 0x3D, 0x3F]); ```

Decoding SBCS bytes to Unicode string

```rust use oemcp::{decodestringcompletetable, decodestringincompletetablechecked, decodestringincompletetablelossy}; use oemcp::codetable::{DECODINGTABLECP437, DECODINGTABLECP874};

asserteq!(&decodestringcompletetable(vec![0xFB, 0xAC, 0x3D, 0xAB], &DECODINGTABLECP437), "√¼=½");

// For encoding that has some undefined code points, you must use decodestringincompletetable{checked,lossy} instead of decodestringcomplete_table

// means shrimp in Thai (U+E49 => 0xE9) asserteq!(decodestringincompletetablechecked(vec![0xA1, 0xD8, 0xE9, 0xA7], &DECODINGTABLECP874), Some("กุ้ง".tostring())); // 0xDB-0xDE,0xFC-0xFF is undefined in CP874 in Windows asserteq!(decodestringincompletetablechecked(vec![0x30, 0xDB], &DECODINGTABLECP874), None); // You can use decodestringincompletetablelossy instead asserteq!(&decodestringincompletetablelossy(vec![0xA1, 0xD8, 0xE9, 0xA7], &DECODINGTABLECP874), "กุ้ง"); // Undefined code points are replaced with U+FFFD (replacement character) asserteq!(&decodestringincompletetablelossy(vec![0x30, 0xDB], &DECODINGTABLE_CP874), "0\u{FFFD}"); ```

Select appropriate codepage from integer

```rust use oemcp::codetable::{ENCODINGTABLECPMAP, DECODINGTABLECPMAP}; use oemcp::{encodingstringchecked, encodingstring_lossy};

if let Some(cp874table) = (*DECODINGTABLECPMAP).get(&874) { asserteq!(cp874table.decodestringchecked(vec![0xA1, 0xD8, 0xE9, 0xA7]), Some("กุ้ง".tostring())); // undefined mapping 0xDB for CP874 asserteq!(cp874table.decodestringchecked(vec![0xDB]), None); asserteq!(&cp874table.decodestring_lossy(vec![0xDB]), "\u{FFFD}"); } else { panic!("Why the hell CP874 isn't registered?"); }

if let Some(cp437table) = (*ENCODINGTABLECPMAP).get(&437) { asserteq!(encodestringchecked("π≈22/7", cp437table), Some(vec![0xE3, 0xF7, 0x32, 0x32, 0x2F, 0x37])); // ¾ is undefined in CP437 asserteq!(encodingstringchecked("½+¼=¾", cp437table), None); // It's replaced with ? (0x3F) asserteq!(encodingstringlossy("½+¼=¾", cp437table), vec![0xAB, 0x2B, 0xAC, 0x3D, 0x3F]); } else { panic!("Why the hell CP437 isn't registered?"); } ```

Support for ANSI/EBCDIC/MBCS code pages

For ANSI (125x) and MBCS (932-950; for CJK languages) code pages, please use encoding_rs instead.

This library is only for extended ASCII encodings (0x00-0x80 must be compatible with ASCII), so EBCDIC encodings will never be supported.

Symbols from 0x01 to 0x19

This library doesn't support symbols mapped from 0x01 to 0x19 in CP437. 0x01-0x19 are mapped to U+0001-U+0019. If you prefer symbols, use codepage_437 instead.

Licenses

MIT