lsmdb

github crates.io docs.rs

The lsmdb is a LSM (Log-Structured Merge) Tree storage engine, a disk-based storage engine designed for efficient read and write operations. It utilizes a combination of in-memory and on-disk data structures to provide fast data retrieval and durability.

Public API

StorageEngine

The StorageEngine struct represents the main component of the LSM Tree storage engine. It consists of the following fields:

The StorageEngine struct provides methods for interacting with the storage engine:

DirPath

The DirPath struct represents the directory paths used by the storage engine. It consists of the following fields:

The DirPath struct provides methods for building and retrieving the directory paths.

SizeUnit

The SizeUnit enum represents the unit of measurement for capacity and size. It includes the following variants:

The SizeUnit enum provides a method to_bytes for converting a given value to bytes based on the selected unit.

Helper Functions

The code includes several helper functions:

These helper functions assist in initializing the storage engine, flushing the memtable to an SSTable when it reaches its capacity, and recovering the memtable from the write-ahead log during initialization, ensuring durability and maintaining data consistency.


MemTable

The MemTable (short for memory table) is an in-memory data structure that stores recently written data before it is flushed to disk. It serves as a write buffer and provides fast write operations.

Dependencies

The implementation requires the following dependencies to be imported from the standard library:

Constants

The implementation defines the following constants:

DEFAULT_MEMTABLE_CAPACITY

Represents the default maximum size of the MemTable. By default, it is set to 1 gigabyte (1GB). rs pub(crate) static DEFAULT_MEMTABLE_CAPACITY: usize = SizeUnit::Gigabytes.to_bytes(1);

DEFAULT_FALSE_POSITIVE_RATE

Represents the default false positive rate for the Bloom filter used in the MemTable. By default, it is set to 0.0001 (0.01%).

rs pub(crate) static DEFAULT_FALSE_POSITIVE_RATE: f64 = 0.0001;

Structure

The MemTable structure represents the in-memory data structure and contains the following fields:

rs pub(crate) struct MemTable { entries: Arc<Mutex<BTreeMap<Vec<u8>, Vec<u8>>>>, entry_count: usize, size: usize, capacity: usize, bloom_filter: BloomFilter, size_unit: SizeUnit, false_positive_rate: f64, }

entries

The entries field is of type Arc<Mutex<BTreeMap<Vec<u8>, Vec<u8>>>>. It holds the key-value pairs of the MemTable in sorted order using a BTreeMap. The Arc (Atomic Reference Counting) and Mutex types allow for concurrent access and modification of the entries data structure from multiple threads, ensuring thread safety.

entry_count

The entry_count field is of type usize and represents the number of key-value entries currently stored in the MemTable.

size

The size field is of type usize and represents the current size of the MemTable in bytes. It is updated whenever a new key-value pair is added or removed.

capacity

The capacity field is of type usize and represents the maximum allowed size for the MemTable in bytes. It is used to enforce size limits and trigger flush operations when the MemTable exceeds this capacity.

bloom_filter

The bloom_filter field is of type BloomFilter and is used to probabilistically determine whether a key may exist in the MemTable without accessing the entries map. It helps improve performance by reducing unnecessary lookups in the map.

size_unit

The size_unit field is of type SizeUnit and represents the unit of measurement used for capacity and size calculations. It allows for flexibility in specifying the capacity and size of the MemTable in different units (e.g., bytes, kilobytes, megabytes, etc.).

false_positive_rate

The false_positive_rate field is of type f64 and represents the desired false positive rate for the bloom filter. It determines the trade-off between memory usage and the accuracy of the bloom filter.

Constructor Methods

new

rs pub(crate) fn new() -> Self

The new method creates a new MemTable instance with the default capacity. It internally calls the with_capacity_and_rate method, passing the default capacity and false positive rate.

with_capacity_and_rate

rs pub(crate) fn with_capacity_and_rate( size_unit: SizeUnit, capacity: usize, false_positive_rate: f64, ) -> Self

The with_capacity_and_rate method creates a new MemTable with the specified capacity, size unit, and false positive rate. It initializes the entries field as an empty BTreeMap, sets the entry_count and size to zero, and creates a new BloomFilter with the given capacity and false positive rate. The capacity is converted to bytes based on the specified size unit.

Public Methods

set

rs pub(crate) fn set(&mut self, key: Vec<u8>, value: Vec<u8>) -> io::Result<()>

The set method inserts a new key-value pair into the MemTable. It first acquires a lock on the entries field to ensure thread-safety. If the key is not present in the BloomFilter, it adds the key-value pair to the entries map, updates the entry_count and size, and sets the key in the BloomFilter. If the key already exists, an AlreadyExists error is returned.

get

sh pub(crate) fn get(&self, key: Vec<u8>) -> io::Result<Option<Vec<u8>>>

The get method retrieves the value associated with a given key from the MemTable. It first checks if the key is present in the BloomFilter. If it is, it acquires a lock on the entries field and returns the associated value. If the key is not present in the BloomFilter, it returns None.

remove

sh pub(crate) fn remove(&mut self, key: Vec<u8>) -> io::Result<Option<(Vec<u8>, Vec<u8>)>>

The remove method removes a key-value pair from the MemTable based on a given key. It first checks if the key is present in the BloomFilter. If it is, it acquires a lock on the entries field and removes the key-value pair from the entries map. It updates the entry_count and size accordingly and returns the removed key-value pair as a tuple. If the key is not present in the BloomFilter, it returns None.

clear

rs pub(crate) fn clear(&mut self) -> io::Result<()>

The clear method removes all key-value entries from the MemTable. It acquires a lock on the entries field, clears the entries map, and sets the entry_count and size fields to zero.

entries

rs pub(crate) fn entries(&self) -> io::Result<Vec<(Vec<u8>, Vec<u8>)>>

The entries method returns a vector of all key-value pairs in the MemTable. It acquires a lock on the entries field and iterates over the key-value pairs in the entries map. It clones each key-value pair and collects them into a vector, which is then returned.

Internal Method

capacity

rs pub(crate) fn capacity(&self) -> usize

The capacity method returns the capacity of the MemTable in bytes.

size

rs pub(crate) fn size(&self) -> usize

The size method returns the current size of the MemTable in the specified size unit. It divides the internal size by the number of bytes in one unit of the specified size unit.

false_positive_rate

rs pub(crate) fn false_positive_rate(&self) -> f64

The false_positive_rate method returns the false positive rate of the MemTable.

size_unit

rs pub(crate) fn size_unit(&self) -> SizeUnit

The size_unit method returns the size unit used by the MemTable.

Error Handling

All the methods that involve acquiring a lock on the entries field use the io::Error type to handle potential errors when obtaining the lock. If an error occurs during the locking process, an io::Error instance is created with a corresponding error message.

Thread Safety

The MemTable implementation ensures thread safety by using an Arc<Mutex<BTreeMap<Vec<u8>, Vec<u8>>>> for storing the key-value entries. The Arc allows multiple ownership of the entries map across threads, and the Mutex ensures exclusive access to the map during modification operations, preventing data races.

The locking mechanism employed by the Mutex guarantees that only one thread can modify the entries map at a time, while allowing multiple threads to read from it simultaneously.

Test Suite

A test suite is provided to ensure the correctness and functionality of the MemTable. It includes tests for creating an empty MemTable, setting and getting key-value pairs, and removing key-value entries.


Write-Ahead Log (WAL)

The Sequential Write-Ahead Log (WAL) is a crucial component of the LSM Tree storage engine. It provides durability and atomicity guarantees by logging write operations before they are applied to the main data structure.

When a write operation is received, the key-value pair is first appended to the WAL. In the event of a crash or system failure, the WAL can be replayed to recover the data modifications and bring the MemTable back to a consistent state.

Dependencies

The implementation requires the following dependencies to be imported from the standard library:

WriteAheadLog Structure

The WriteAheadLog structure represents the write-ahead log (WAL) and contains the following field:

rs struct WriteAheadLog { log_file: Arc<Mutex<File>>, }

log_file

The log_file field is of type Arc<Mutex<File>>. It represents the WAL file and provides concurrent access and modification through the use of an Arc (Atomic Reference Counting) and Mutex.

Log File Structure Diagram

The log_file is structured as follows:

sh +-------------------+ | Entry Length | (4 bytes) +-------------------+ | Entry Kind | (1 byte) +-------------------+ | Key Length | (4 bytes) +-------------------+ | Value Length | (4 bytes) +-------------------+ | Key | (variable) | | | | +-------------------+ | Value | (variable) | | | | +-------------------+ | Entry Length | (4 bytes) +-------------------+ | Entry Kind | (1 byte) +-------------------+ | Key Length | (4 bytes) +-------------------+ | Value Length | (4 bytes) +-------------------+ | Key | (variable) | | | | +-------------------+ | Value | (variable) | | | | +-------------------+

Each entry is written sequentially into the log_file using the write_all method, ensuring that the entries are stored contiguously. New entries are appended to the end of the log_file after the existing entries.

Constants

A constant named WAL_FILE_NAME is defined, representing the name of the WAL file.

rs static WAL_FILE_NAME: &str = "lsmdb_wal.bin";

EntryKind

rs enum EntryKind { Insert = 1, Remove = 2, }

The EntryKind enum represents the kind of entry stored in the WAL. It has two variants: Insert and Remove. Each variant is associated with an integer value used for serialization.

WriteAheadLogEntry

rs struct WriteAheadLogEntry { entry_kind: EntryKind, key: Vec<u8>, value: Vec<u8>, }

The WriteAheadLogEntry represents a single entry in the Write-Ahead Log. It contains the following fields:

WriteAheadLogEntry Methods

new

rs fn new(entry_kind: EntryKind, key: Vec<u8>, value: Vec<u8>) -> Self

The new method creates a new instance of the WriteAheadLogEntry struct. It takes the entry_kind, key, and value as parameters and initializes the corresponding fields.

serialize

rs fn serialize(&self) -> Vec<u8>

The serialize method serializes the WriteAheadLogEntry into a vector of bytes. It calculates the length of the entry, then serializes the length, entry kind, key length, value length, key, and value into the vector. The serialized data is returned.

deserialize

rs fn deserialize(serialized_data: &[u8]) -> io::Result<Self>

This method deserializes a WriteAheadLogEntry from the provided serialized data. It performs validation checks on the length and structure of the serialized data and returns an io::Result containing the deserialized entry if successful.

WriteAheadLog Methods

new

rs fn new(directory_path: &PathBuf) -> io::Result<Self>

The new method is a constructor function that creates a new WriteAheadLog instance. It takes a directory_path parameter as a PathBuf representing the directory path where the WAL file will be stored.

If the directory doesn't exist, it creates it. It then opens the log file with read, append, and create options, and initializes the log_file field.

append

rs fn append(&mut self, entry_kind: EntryKind, key: Vec<u8>, value: Vec<u8> ) -> io::Result<()>

The append method appends a new entry to the Write-Ahead Log. It takes an entry_kind parameter of type EntryKind, a key parameter of type Vec<u8>, and a value parameter of type Vec<u8>. The method acquires a lock on the log_file to ensure mutual exclusion when writing to the file.

It creates a WriteAheadLogEntry with the provided parameters, serializes it, and writes the serialized data to the log file.

Finally, it flushes the log file to ensure the data is persisted. If the operation succeeds, Ok(()) is returned; otherwise, an io::Error instance is created and returned.

recover

rs fn recover(&mut self) -> io::Result<Vec<WriteAheadLogEntry>>

The recover method reads and recovers the entries from the Write-Ahead Log. The method acquires a lock on the log_file to ensure exclusive access during the recovery process.

It reads the log file and deserializes the data into a vector of WriteAheadLogEntry instances. It continues reading and deserializing until the end of the log file is reached. The recovered entries are returned as a vector.

clear

rs fn clear(&mut self) -> io::Result<()>

The clear method clears the contents of the WAL file. It acquires a lock on the log_file to ensure exclusive access when truncating and seeking the file. The method sets the length of the file to 0 using the set_len method, effectively truncating it. Then, it seeks to the start of the file using seek with SeekFrom::Start(0) to reset the file pointer. If the operation succeeds, Ok(()) is returned; otherwise, an io::Error instance is created and returned.

Thread Safety

The WriteAheadLog implementation ensures thread safety by using an Arc<Mutex<File>> for the log_file field. The Arc allows multiple ownership of the WAL file across threads, and the Mutex ensures exclusive access to the file during write, recovery, and clear operations, preventing data races.

The locking mechanism employed by the Mutex guarantees that only one thread can modify the WAL file at a time, while allowing multiple threads to read from it simultaneously.

Testing

The module includes a test function that verifies the correctness of the serialization and deserialization process.

It creates a sample WriteAheadLogEntry, serializes it, and then deserializes the serialized data back into an entry. It compares the deserialized entry with the original entry to ensure consistency.


Bloom Filter

The Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It provides a fast and memory-efficient way to check for set membership, but it introduces a small probability of false positives.

The Bloom Filter implementation is provided as a Rust module and consists of a struct called BloomFilter. It uses a BitVec to represent the array of bits that make up the filter. The number of hash functions used by the Bloom Filter is configurable, and it keeps track of the number of elements inserted into the filter.

Dependencies

Structure

The BloomFilter struct represents the Bloom Filter data structure and contains the following fields:

rs pub(crate) struct BloomFilter { bits: Arc<Mutex<BitVec>>, num_hashes: usize, num_elements: AtomicUsize, }

bits

An Arc<Mutex<BitVec>> representing the array of bits used to store the Bloom filter.

num_hashes

The number of hash functions used by the Bloom filter.

num_elements

An AtomicUsize representing the number of elements inserted into the Bloom filter.

Constructor Methods

new

rs fn new(num_elements: usize, false_positive_rate: f64) -> Self

The new method creates a new BloomFilter with the specified number of elements and false positive rate. It initializes the Bloom filter's bit array, calculates the number of hash functions, and sets the initial number of elements to zero.

Public Methods

set

rs fn set<T: Hash>(&mut self, key: &T)

The set method inserts a key into the Bloom filter. It calculates the hash value for the key using multiple hash functions and sets the corresponding bits in the bit array to true. It also increments the element count.

contains

rs fn contains<T: Hash>(&self, key: &T) -> bool

The contains method checks if a key is possibly present in the Bloom filter. It calculates the hash value for the key using multiple hash functions and checks the corresponding bits in the bit array. If any of the bits are false, it indicates that the key is definitely not present, and the method returns false. If all bits are true, the method returns true, indicating that the key is possibly present.

num_elements

rs fn num_elements(&self) -> usize

This method returns the current number of elements inserted into the Bloom filter.

Internal Method

calculate_hash

rs fn calculate_hash<T: Hash>(&self, key: &T, seed: usize) -> u64

This function calculates a hash value for a given key and seed. It uses a suitable hash function to hash the key and incorporates the seed value for introducing randomness.

calculate_num_bits

rs fn calculate_num_bits(num_elements: usize, false_positive_rate: f64) -> usize

This function calculates the optimal number of bits for the Bloom filter based on the desired false positive rate and the expected number of elements. It uses a formula to estimate the number of bits required.

calculate_num_hashes

rs fn calculate_num_hashes(num_bits: usize, num_elements: usize) -> usize

This function calculates the optimal number of hash functions for the Bloom filter based on the number of bits and the expected number of elements. It uses a formula to estimate the number of hash functions required.

Usage within MemTable Architecture

In the MemTable architecture pattern, the Bloom Filter is utilized as a component within the MemTable data structure. The MemTable serves as an in-memory structure that holds recently updated key-value pairs before they are flushed to disk. The Bloom Filter is used to optimize data retrieval by quickly determining if a key is likely to be present in the MemTable, thus reducing the need for disk accesses.

When a key-value pair is inserted into the MemTable, the key is also inserted into the Bloom Filter using the set method. During data retrieval, the Bloom Filter's contains method is first used to check if the key is possibly present in the MemTable. If the Bloom Filter indicates that the key is possibly present, further disk accesses can be performed to retrieve the corresponding value. If the Bloom Filter indicates that the key is definitely not present, disk accesses can be avoided, improving overall performance.

It is important to note that the Bloom Filter is a probabilistic data structure and can produce false positives, indicating that a key is possibly present even when it is not. Therefore, if the Bloom Filter indicates that a key is possibly present, a subsequent disk access is still required to validate the actual presence of the key and retrieve the correct value.


SSTable (Sorted String Table)

An SSTable, or Sorted String Table, is an immutable on-disk data structure that stores key-value pairs in a sorted order. It serves as the persistent storage layer for the LSM Tree-based engine. SSTables are typically stored as multiple files, each containing a sorted range of key-value pairs.

When the MemTable reaches a certain threshold size, it is "flushed" to disk as a new SSTable file. The MemTable is atomically replaced with an empty one, allowing new write operations to continue. This process is known as a "memtable flush."

```rs +-----------------------+ | SSTable | +-----------------------+ | - filepath | (PathBuf) | - blocks | (Vec) | - createdat | (DateTime) +-----------------------+ | + new(dir: PathBuf) | -> SSTable | + set(key, value) | -> Result<(), io::Error> | + get(key) | -> Option> | + remove(key) | -> Result<(), io::Error> +-----------------------+

+-----------------------+ | Block | +-----------------------+ | - data | (Vec) | - index | (HashMap>, usize>) | - entrycount | (usize) +-----------------------+ | + new() | -> Block | + isfull(size) | -> bool | + setentry(key, value) | -> Result<(), io::Error> | + removeentry(key) | -> bool | + getvalue(key) | -> Option> | + entrycount() | -> usize +-----------------------+ ```

The SSTable struct represents the Sorted String Table and contains the following fields: - file_path: Stores the path of the SSTable file (PathBuf). - blocks: Represents a collection of blocks that hold the data (Vec). - created_at: Indicates the creation timestamp of the SSTable (DateTime).

The SSTable struct provides the following methods:

The Block struct represents an individual block within the SSTable and contains the following fields:

The Block struct provides the following methods:

Together, the SSTable and Block form the basic components of the SSTable implementation, providing efficient storage and retrieval of key-value pairs with support for adding and removing entries.

The SSTable manages multiple Block instances to store the data, and the Block handles individual block-level operations and indexing.

A diagram illustrating how key-value pairs are stored inside a Block:

rs +----------------------------------+ | Block | +----------------------------------+ | - data: Vec<u8> | // Data entries within the block | - index: HashMap<Arc<Vec<u8>>, usize> | // Index for key-based lookups | - entry_count: usize | // Number of entries in the block +----------------------------------+ | Block Data | | +------------------------+ | | | Entry 1 | | | | +-------------------+ | | | | | Length Prefix | | | | | | (4 bytes, little- | | | | | | endian format) | | | | | +-------------------+ | | | | | Key | | | | | | (variable length) | | | | | +-------------------+ | | | | | Value | | | | | | (variable length) | | | | | +-------------------+ | | | +------------------------+ | | | Entry 2 | | | | ... | | | +------------------------+ | +----------------------------------+

In the diagram: - The Block struct represents an individual block within the SSTable. - The data field of the Block is a vector (Vec<u8>) that stores the data entries. - The index field is a HashMap that maintains the index for efficient key-based lookups. - The entry_count field keeps track of the number of entries in the block.

Each entry within the block consists of three parts: 1. Length Prefix: A 4-byte length prefix in little-endian format, indicating the length of the value. 2. Key: Variable-length key bytes. 3. Value: Variable-length value bytes.

The block's data vector (data) stores these entries sequentially. Each entry follows the format mentioned above, and they are concatenated one after another within the data vector. The index hashmap (index) maintains references to the keys and their corresponding offsets within the data vector.


Author: Nrishinghananda Roy