LilLil LogLog!
LilLil LogLog (or just LogLog for short) is (well... will be) a distributed, replicated,
ordered, strongly consistent, (eventually) fault-tolerant binary log in Rust.
Anytime I think about communication patterns in distributed systems,
it hits me that I really, really need:
- replicated log (so the data doesn't go poof)
- strongly and consistently ordered (no spooky non-deterministc behavior)
- with decent performance
- that scales reasonably both up and down (so I can start with my $.50/day k8s cluster,
yet later down the line I don't have to rewrite everything, just because it starts
to see some real action)
There's a reason every distributed storage starts with replicated log under the hood.
So why there are so few distributed logs that I could use directly?
The closest thing to what I want is Log Cabin,
but I want it be content-oblivious, and will try to make it faster. Plus
I'm not touching C++. Another similiar thing seem to be Apache BookKeeper, but
it is just another resource and operationally heavy Java project that I don't want to work with.
So here is the plan:
- Build the simplest viable distributed log in Rust as possible.
- Start with single node version, then add replication and Raft to get HA.
- Design data model for theoretically good perf.
- Steal all the good ideas I'm aware of from existing projects.
- Try making it scale up as much as possible without sharding by offloading as much as possible
to client.
- The distributed part is mostly about replication (so we don't loose data). Sharding usually
can't guarantee ordering between shards anyway, so it's not as interesting (at least yet)
- If you need sharding you probably outgrew LilLilLogLog - build something on top, or just split
your log, or (fork? and) implement sharding.
- Use byte offsets in segmented (split into separate files) log as ids (like Kafka).
- Limit event sizes to around 3B value. Puts some reasonable bounds on latency.
- All writes start going through the leader, but the client is supposed to upload copies
to other replicas themselves, to help get the load off the leader.
- On client "append" request, read the fixed header (including length of content) in one
read
,
then use the lenght to copy the content to file's fd
ASAP. Ideally this would be zero-copy,
but seems like tcpsocket->file can't be zero-copy. Since we probably need to load into userspace
buffer, might as well keep the most recent ones for fast response.
- Don't bother even looking at the data. If the client wants data integrity, they can append
a checksum, and if they ever spot a corrupted data, they can try reading from other replicas.
If the client wants multiple things in one append-operation, they can deal with it on their side it too.
- Readers will just get whole batches of events whole. They can use
any replica, and replicas won't return anything they don't have or that is not yet commited.
- Seems like
tokio-uring
has enough functionality to fit these needs, so might as well start with it.
I'm definitely a lillil bit out of depth here, but I trust the Internet will reliably tell me which of my
ideas are stupid, and worst case I'll just learn some stuff and deliver nothing. 🤷
Read the design doc