Lance: A Columnar Data Format for Deep Learning Dataset

CI

Lance is a cloud-native columnar data format designed for unstructured machine learning datasets, featuring:

Non-goals:

How to Use Lance

Thanks for its Apache Arrow-first APIs, lance can be used as a native Arrow extension. For example, it enables users to directly use DuckDB to analyze lance dataset via DuckDB's Arrow integration.

```python

pip install pylance duckdb

import lance import duckdb

Understand Label distribution of Oxford Pet Dataset

ds = lance.dataset("s3://eto-public/datasets/oxfordpet/pet.lance") duckdb.query('select label, count(1) from ds group by label').toarrow_table() ```

Why

Machine Learning development cycle involves the steps:

mermaid graph LR A[Collection] --> B[Exploration]; B --> C[Analytics]; C --> D[Feature Engineer]; D --> E[Training]; E --> F[Evaluation]; F --> C; E --> G[Deployment]; G --> H[Monitoring]; H --> A;

People use different data representations to varying stages for the performance or limited by the tooling available. The academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While the industry uses data lake (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouse (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or Tfrecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice among ML practices.

While each of the existing data formats excel at its original designed workload, we need a new data format to tailored for multistage ML development cycle to reduce the fraction in tools and data silos.

A comparison of different data formats in each stage of ML development cycle.

| | Lance | Parquet & ORC | JSON & XML | Tfrecord | Database | Warehouse | |---------------------|-------|---------------|------------|----------|----------|-----------| | Analytics | Fast | Fast | Slow | Slow | Decent | Fast | | Feature Engineering | Fast | Fast | Decent | Slow | Decent | Good | | Training | Fast | Decent | Slow | Fast | N/A | N/A | | Exploration | Fast | Slow | Fast | Slow | Fast | Decent | | Infra Support | Rich | Rich | Decent | Limited | Rich | Rich |

Presentations and Talks