Lance is a cloud-native columnar data format designed for unstructured machine learning datasets, featuring:
Non-goals:
Thanks for its Apache Arrow-first APIs, lance
can be used as a native Arrow
extension.
For example, it enables users to directly use DuckDB
to analyze lance dataset
via DuckDB's Arrow integration.
```python
import lance import duckdb
ds = lance.dataset("s3://eto-public/datasets/oxfordpet/pet.lance") duckdb.query('select label, count(1) from ds group by label').toarrow_table() ```
Machine Learning development cycle involves the steps:
mermaid
graph LR
A[Collection] --> B[Exploration];
B --> C[Analytics];
C --> D[Feature Engineer];
D --> E[Training];
E --> F[Evaluation];
F --> C;
E --> G[Deployment];
G --> H[Monitoring];
H --> A;
People use different data representations to varying stages for the performance or limited by the tooling available. The academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While the industry uses data lake (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouse (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or Tfrecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice among ML practices.
While each of the existing data formats excel at its original designed workload, we need a new data format to tailored for multistage ML development cycle to reduce the fraction in tools and data silos.
A comparison of different data formats in each stage of ML development cycle.
| | Lance | Parquet & ORC | JSON & XML | Tfrecord | Database | Warehouse | |---------------------|-------|---------------|------------|----------|----------|-----------| | Analytics | Fast | Fast | Slow | Slow | Decent | Fast | | Feature Engineering | Fast | Fast | Decent | Slow | Decent | Good | | Training | Fast | Decent | Slow | Fast | N/A | N/A | | Exploration | Fast | Slow | Fast | Slow | Fast | Decent | | Infra Support | Rich | Rich | Decent | Limited | Rich | Rich |