Create a non-duplicated vector of values without extra memory allocations, even for ref values like String
, Vec
, and Box
. Each insertion returns the usize
index of the inserted value. When done, the entire vector can be used.
This approach is useful for creating a vector of unique values, such as a list of unique strings, or a list of unique objects, and then using the index of the value in the vector as a unique identifier, e.g. in a protobuf message.
```rust use dup_indexer::DupIndexer;
fn main() {
// Using String
values
let mut di = DupIndexer::new();
asserteq!(di.insert("hello".tostring()), 0);
asserteq!(di.insert("world".tostring()), 1);
asserteq!(di.insert("hello".tostring()), 0);
asserteq!(di.intovec(), vec!["hello", "world"]);
// Using i32 values
let mut di = DupIndexer::with_capacity(10);
assert_eq!(di.insert(42), 0);
assert_eq!(di.insert(13), 1);
assert_eq!(di.insert(42), 0);
assert_eq!(di[1], 13); // use it as a read-only vector
assert_eq!(di.into_iter().collect::<Vec<_>>(), vec![42, 13]);
} ```
DupIndexer keeps inserted values in a vector in the order of insertion. It also tracks inserted values in a lookup HashMap<T, usize>
where T
is the type of the inserted value. This means that the inserted values must implement Hash
and Eq
.
The value types like ints, floats, bools, chars and any references like &str
cause no issues because they can be copied to both the vector and the lookup map containers. However, the non-copyable types with memory allocation like String
and Vec
cannot be owned by both containers at the same time. To solve this, DupIndexer creates a shallow non-droppable copy of the value, and stores it in the hashmap, whereas the original value goes into the vector:
```rust,ignore
pub struct DupIndexer
pub fn insert(&mut self, value: T) -> usize { let dupvalue = ManuallyDrop::new(unsafe { ptr::read(&value) }); match self.lookup.entry(dupvalue) { Occupied(entry) => *entry.get(), Vacant(entry) => { let index = self.values.len(); entry.insert(index); self.values.push(value); index } } } ```
This way, the hashmap owns the shallow copy, and the vector owns the original value. On subsequent calls, the new value is checked against the hashmap for duplicates. Once finished, the vector with the keys is consumed by the user with .into_vec()
, and the hashmap is dropped without dropping the actual keys.
I believe the above code is safe because the hashmap only keeps the ptr:read
-created copy of the original value while we own it, and the value is never modified. Miri mostly agrees with this, passing all tests except the one where T
is a Box<i32>
. When the test tries to insert a duplicate value, I see the following Miri warning. Do let me know if you know if this is really an issue and how to fix this.
```text ❯ cargo +nightly miri test
--> .../.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:1328:23
|
1328 | PartialEq::eq(&self, &other) | ^^^^^^^ | | | trying to retag from <170684> for SharedReadOnly permission at alloc64636[0x0], but that tag does not exist in the borrow stack for this location | this error occurs as part of retag at alloc64636[0x0..0x4] | = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information help: <170684> was created by a Unique retag at offsets [0x0..0x4] --> src/lib.rs:49:17 | 49 | entry.insert(index); | ^^^^^^^^^^^^^^^^^^^ help: <170684> was later invalidated at offsets [0x0..0x4] by a Unique retag --> src/lib.rs:50:34 | 50 | self.values.push(value); | ^^^^^ ```
cargo test
.cargo bench -p bench
.cargo +nightly miri test
(note that one test is disabled due to the above issue).git push
will run a few validations first, i.e. fmt
, clippy
, test
, ...