Uses rayon
for parallelism and falls back on unstable_sort
for short slices.
Only works on power-of-two arrays for now.
Benchmarks on 4-core (8 threads) Kaby Lake 3.8GHz laptop:
``` running 9 tests test stdbitonic128 ... bench: 1,211 ns/iter (+/- 73) test stdbitonic32768 ... bench: 630,865 ns/iter (+/- 90,803) test stdbitonic65536 ... bench: 1,373,111 ns/iter (+/- 78,431)
test stdstable128 ... bench: 1,721 ns/iter (+/- 109) test stdstable32768 ... bench: 1,234,859 ns/iter (+/- 150,314) test stdstable65536 ... bench: 2,603,823 ns/iter (+/- 151,850)
test stdunstable128 ... bench: 1,211 ns/iter (+/- 184) test stdunstable32768 ... bench: 878,739 ns/iter (+/- 51,668) test stdunstable65536 ... bench: 1,721,517 ns/iter (+/- 127,620) ```