OOXML - Office OpenXML parser in Rust

This crate is started as a private-purposed project with limited knownledge of Office Open XML, use it with caution!

Office Open XML,为由Microsoft开发的一种以XML为基础并以ZIP格式压缩的电子文件规范,支持文件、表格、备忘录、幻灯片等文件格式。

Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.

OOXML, as it's naming, is trying to be a pure rust implementation of Office Open XML parser - reading and writing ooxml components efficiently in Rust. But at now, only xlsx parsing is supported.

TLDR;

Example code in examples/xlsx.rs:

```rust use ooxml::document::SpreadsheetDocument;

fn main() { let xlsx = SpreadsheetDocument::open("examples/simple-spreadsheet/data-image-demo.xlsx").unwrap();

let workbook = xlsx.get_workbook();
//println!("{:?}", xlsx);

let _sheet_names = workbook.worksheet_names();

for (sheet_idx, sheet) in workbook.worksheets().iter().enumerate() {
    println!("worksheet {}", sheet_idx);
    println!("worksheet dimension: {:?}", sheet.dimenstion());
    println!("---------DATA---------");
    for rows in sheet.rows() {
        // get cell values
        let cols: Vec<_> = rows
            .into_iter()
            .map(|cell| cell.value().unwrap_or_default())
            .collect();
        println!("{}", itertools::join(&cols, ","));
    }
}

}

```

Run cargo run --example xlsx:

``` worksheet 0 worksheet dimension: Some((1, 1)) ---------DATA---------


worksheet 1 worksheet dimension: Some((4, 4)) ---------DATA--------- name,age,birthday,last edited bob,17,1983/12/12,2020/10/11 19:59 tom,18,1982/12/12,2020/10/11 19:59

cury,20,1980-12-12,2020-10-11 19:59

```

Library Design

The main idea come from the [DotNet OpenXML SDK].

  1. Implement [OpenXML Package Convention] for any OOXML format(docx/xlsx/pptx...), including:
  2. Implement shared OpenXML parts
  3. Implement Excel/SpreadsheetML specifications
  4. Other OpenXML formats(docx, pptx)

Codebase tree structure will be like below.

text src ├── document │ ├── mod.rs │ ├── presentation │ │ └── mod.rs │ ├── spreadsheet │ │ ├── cell.rs │ │ ├── chart.rs │ │ ├── document_type.rs │ │ ├── drawing.rs │ │ ├── media.rs │ │ ├── mod.rs │ │ ├── shared_string.rs │ │ ├── style.rs │ │ ├── workbook.rs │ │ └── worksheet.rs │ └── wordprocessing │ └── mod.rs ├── drawing │ └── mod.rs ├── error.rs ├── lib.rs ├── math │ └── mod.rs └── packaging ├── app_property.rs ├── content_type.rs ├── custom_property.rs ├── element.rs ├── mod.rs ├── namespace.rs ├── package.rs ├── part │ ├── container.rs │ ├── mod.rs │ └── pair.rs ├── property.rs ├── relationship │ ├── mod.rs │ └── reference.rs ├── variant.rs ├── xml.rs └── zip.rs

Definitions For the Crate

The main design principle is typed everything.

The data flows open or create an document will be like below.

```plantuml Document -> Package : open/parse from Package -> Parts : parse to parts Parts -> Components: build components tree Components -> Elements: elements one-to-one map Elements -> Components: elements changes Components -> Parts: components write back Parts -> Package: serialize to package Package <- Document: flush, save or others

Document -> Components: create new document. add or remove components Components <-> Elements: operations Components -> Parts: component add/remove Parts -> Package: serialize to package Document -> Package: flush, save or others ```

Initialize Implementing Features

TODOS: - create marker traits for OpenXML element, make it more generialize. - use minidom in an xml part, tracking the changes and write back to dom tree. - lazy parse some of the openxml part for first start speedup. - implement helper macros for component generation.

Tokei - 2020-11-04-11:35:51

```text

Language Files Lines Code Comments Blanks

Markdown 1 272 0 230 42 Plain Text 1 1 0 1 0 TOML 1 23 21 1 1

XML 52 164 164 0 0

Rust 34 2721 2189 194 338 |- Markdown 14 106 7 90 9

(Total) 2827 2196 284 347

Total 89 3287 2381 516 390

```

Concepts

Office Open XML, or OpenXML

Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.

Microsoft Office 2010 provides read support for ECMA-376, read/write support for ISO/IEC 29500 Transitional, and read support for ISO/IEC 29500 Strict. Microsoft Office 2013 and Microsoft Office 2016 additionally support both reading and writing of ISO/IEC 29500 Strict.While Office 2013 and onward have full read/write support for ISO/IEC 29500 Strict, Microsoft has not yet implemented the strict non-transitional, or original standard, as the default file format yet due to remaining interoperability concerns.

OpenXML Package Convention

The Open Packaging Conventions (OPC) is a container-file technology initially created by Microsoft to store a combination of XML and non-XML files that together form a single entity such as an Open XML Paper Specification (OpenXPS) document. OPC-based file formats combine the advantages of leaving the independent file entities embedded in the document intact and resulting in much smaller files compared to normal use of XML.

Standard ECMA-376

[Standard ECMA-376] - The Office Open XML File Formats standard.

1st edition (December 2006), 2nd edition (December 2008), 3rd edition (June 2011), 4th edition (December 2012) and 5th edition (Part 3, December 2015; and Parts 1 & 4, December 2016).

Edition downloads:

Currently is 4th edition, technically aligned with ISO/IEC 29500. 5th edition is ongoing. There is a [Office Open XML Overview] introduction pdf file.

SpreadsheetML

A SpreadsheetML or .xlsx file is a zip file (a package) containing a number of "parts" (typically UTF-8 or UTF-16 encoded) or XML files. The package may also contain other media files such as images. The structure is organized according to the Open Packaging Conventions as outlined in Part 2 of the OOXML standard ECMA-376.

You can look at the file structure and the files that comprise a SpreadsheetML file by simply unzipping the .xlsx file.

text ├── [Content_Types].xml ├── docProps │ ├── app.xml │ ├── core.xml │ └── custom.xml ├── _rels └── xl ├── charts │ ├── chart1.xml │ ├── colors1.xml │ ├── _rels │ │ └── chart1.xml.rels │ └── style1.xml ├── drawings │ ├── drawing1.xml │ ├── drawing2.xml │ └── _rels │ ├── drawing1.xml.rels │ └── drawing2.xml.rels ├── media │ └── image1.png ├── _rels │ └── workbook.xml.rels ├── sharedStrings.xml ├── styles.xml ├── theme │ └── theme1.xml ├── workbook.xml └── worksheets ├── _rels │ ├── sheet1.xml.rels │ └── sheet2.xml.rels ├── sheet1.xml └── sheet2.xml

The number and types of parts will vary based on what is in the spreadsheet, but there will always be a [Content_Types].xml, one or more relationship parts, a workbook part , and at least one worksheet. The core data of the spreadsheet is contained within the worksheet part(s), discussed in more detail at xslx Content Overview.

Resources

  1. Wikipedia Office OpenXML: English, 中文.
  2. Microsoft [DotNet OpenXML SDK] documents and source code.
  3. Wikipedia [OpenXML Package Convention] - [开放打包约定].
  4. What is OOXML: http://officeopenxml.com/
  5. SpreadsheetML: http://officeopenxml.com/anatomyofOOXML-xlsx.php
  6. Rust quick-xml documents.
  7. Rust docx-rs documents and source code on github.
  8. Go Excel file parser excelize.
  9. [Standard ECMA-376].