Inference

A Rust crate for managing the inference process for machine learning (ML) models. Currently, we support interacting with a Triton Inference Server, loading models from a MinIO Model Store.

Requirements

For Debian-based Linux distros, you can install inference's dependencies (except Docker & NVIDIA container toolkit, that require special repository configuration documented above) with the following command:

apt-get install clang build-essential lld clang protobuf-compiler libprotobuf-dev zstd libzstd-dev make cmake pkg-config libssl-dev

inference is tested on Ubuntu 22.04 LTS, but welcomes pull requests to fix Windows or MacOS issues.

Quick Start

  1. Clone repo: git clone https://github.com/opensensordotdev/inference.git
  2. make: Download the latest versions of the Triton Inference Server Protocol Buffer files & Triton sample ML models
  3. docker compose up: Start the MinIO and Triton containers + monitoring infrastructure
  4. Upload the contents of the sample_models directory to the models bucket vis the MinIO web UI at localhost:9001
  5. cargo test: Verify all cargo tests pass

Model Inspection

http://localhost:8000/v2/models/simple

Will print model name and parameters required to set up the inputs and outputs.

Errata

gRPC Setup

proto folder will contain protocol buffers. Only grpc_service.proto is referenced in the build.rs because model_config.proto is included by grpc_service. Generated code from tonic is in inference.rs

Multiplexing Tonic Channels

Submitting requests to a gRPC service requires a mutable reference to a Client. This prohibits you from passing a single Client around to multiple Tasks and creates a bottleneck for async code.

Trying to hide this from users by wrapping what amounts to a synchronous resource in a struct and using async message passing to access it might help some but still doesn't fix the core problem.

While it would be possible to make a connection pool of multiple Client<Channel>s and hide this pool in a struct accessed with async message passing, this is complicated.

It also doesn't work to store a tonic.transport.Channel in the TritonClient struct...it requires the struct to implement some obscure internal tonic traits. tonic.transport.Channel.

The idiomatic way appears to be storing a single master Client in a struct and then providing a function that returns a clone of the Client since Cloning clients is cheap.

A limitation of this could be that gRPC servers usually have a finite number of connections they can multiplex (100 seems to be the number a lot of places throw out). See gRPC performance best practices.

tonic seems to have a default buffer size of 1024. Source: DEFAULT_BUFFER_SIZE channel/mod.rs

This might be useful eventually if you have multiple Triton pods and want to discover which ones are live + update the endpoint list grpc load balancing, github.

Not clear if there's a connection pool under the hood there or how they're able to connect to multiple servers?

Alternate Implementations/Inspiration

Triton-client-rs

Integration with DALI

DALI (Data Loading Library) Triton DALI backend