A global, auto-scaling, preemptive scheduler using work-balancing.
smolscale
is a work-balancing executor based on [async-task], designed to be a drop-in replacement to smol
and async-global-executor
. It is designed based on the thesis that work-stealing, the usual approach in async executors like async-executor
and tokio
, is not the right algorithm for scheduling huge amounts of tiny, interdependent work units, which are what message-passing futures end up being. Instead, smolscale
uses work-balancing, an approach also found in Erlang, where a global "balancer" thread periodically balances work between workers, but workers do not attempt to steal tasks from each other. This avoids the extremely frequent stealing attempts that work-stealing schedulers generate when applied to async tasks.
smolscale
's approach especially excels in two circumstances:
smolscale
will instead drastically reduce CPU usage in these circumstances --- a async-executor
app that takes 80% of CPU time may now take only 20%. Although this does not improve fully-loaded throughput, it significantly reduces power consumption and does increase throughput in circumstances where multiple thread pools compete for CPU time.smolscale
can significantly improve throughput, especially compared to executors like async-executor
that do not special-case message passing.Furthermore, smolscale has a preemptive thread pool that ensures that tasks cannot block other tasks no matter what. This means that you can do things like run expensive computations or even do blocking I/O within a task without worrying about causing deadlocks. Even with "traditional" tasks that do not block, this approach can reduce worst-case latency. Preemption is heavily inspired by Stjepan Glavina's previous work on async-std.
smolscale
also experimentally includes Nursery
, a helper for structured concurrency on the smolscale
global executor.
Right now, smolscale
uses a very naive implementation (for example, stealable local queues are implemented as SPSC queues with a spinlock on the consumer side, and worker parking is done naively through event-listener
), and its performance is expected to drastically increase. However, at most tasks it is already much faster than async-global-executor
(the de-facto standard "non-Tokio-world" executor, which powers async-std
), sometimes an order of magnitude faster. Here are some unscientific benchmark results; percentages are compared to async-global-executor
:
``` spawn_one time: [105.08 ns 105.21 ns 105.36 ns] change: [-98.570% -98.549% -98.530%] (p = 0.00 < 0.05) Performance has improved.
spawn_many time: [3.0585 ms 3.0598 ms 3.0613 ms] change: [-87.576% -87.291% -86.948%] (p = 0.00 < 0.05) Performance has improved.
yieldnow time: [4.1676 ms 4.1917 ms 4.2166 ms] change: [-50.455% -49.994% -49.412%] (p = 0.00 < 0.05) Performance has improved. // pingpong time: [8.5389 ms 8.6990 ms 8.8525 ms] change: [+12.264% +14.548% +16.917%] (p = 0.00 < 0.05) Performance has regressed.
Benchmarking spawnexecutorsrecursively: spawnexecutorsrecursively time: [180.26 ms 180.40 ms 180.56 ms] change: [+497.14% +500.08% +502.97%] (p = 0.00 < 0.05) Performance has regressed.
contextswitchquiet time: [100.67 us 102.05 us 103.07 us] change: [-42.789% -41.170% -39.490%] (p = 0.00 < 0.05) Performance has improved.
contextswitchbusy time: [8.7637 ms 8.9012 ms 9.0561 ms] change: [+3.3147% +5.5719% +7.6684%] (p = 0.00 < 0.05) Performance has regressed. ```