[[file:fusta.png]]

The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility with all existing programs. When handling large multiFASTA files, the intrinsic file caching capacities of the OS are leveraged to ensure the best experience to the user.

* Licensing FUSTA is distributed under the CeCILL-C (LGPLv3 compatible) license. Please see the LICENSE file for details. * Installation * Ubuntu

+begin_src

sudo apt install cargo fuse3 libfuse3-dev pkg-config cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Fedora/Rocky Linux/Alma Linux

+begin_src

sudo yum install rust cargo fuse3 fuse3-devel cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Debian First, install the required dependencies:

+begin_src bash

sudo apt install curl fuse3 libfuse3-dev curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Debian cargo is outdated

+end_src

then rebot your shell to update the =PATH= environment variable.

Finally, install FUSTA:

+begin_src bash

cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Scientific Linux First, install the required dependencies:

+begin_src bash

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh sudo yum install fuse3 fuse3-devel

+end_src

then rebot your shell to update the =PATH= environment variable.

Finally, install FUSTA:

+begin_src

cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** macOS On macOS, you will need to install the build tools if you have not them ready yet: =xcode-select --install=

You must then download and install [[https://osxfuse.github.io/][FUSE for macOS]] in order to be able to use FUSTA.

Finalyy, to build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool:

+begin_src

cargo install --git https://github.com/delehef/fusta

+end_src

** FreeBSD First, install the required dependencies:

+begin_src bash

sudo pkg install rust pkgconf fusefs-libs # Install build dependencies sudo sysctl vfs.usermount=1 # enable FUSE mounting without requiring administrator permissions sudo kldload fuse # load the FUSE kernel module

+end_src

Then, install FUSTA /per se/:

+begin_src

cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** From Sources You should install FUSE (as well as its potential =devel= package), from your package manager – note that a reboot might be necessary for the kernel module to be loaded.

To build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool:

+begin_src

cargo install --git https://github.com/delehef/fusta

+end_src

You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. * Usage ** Quick Start These commands run =fusta= in the background, mount the FASTA file =file.fa= in an automatically created =fusta= folder, exposing all the sequences contained in =file.fa= there. The call to =tree= will display the virtual hierarchy, then =fusermount= is called to cleanly unmount the file.

+begin_src

fusta file.fa tree -h fusta/ fusermount -u fusta

+end_src

** Description Once started, =fusta= will expose the content of a FASTA file in a way that makes it usable by any piece of software using as if it were a set of independent files, detailed as follow.

For instance, here is the virtual hierarchy created by =fusta= after mounting a FASTA file containing /A. thaliana/ genome

+begin_src

fusta ├── append ├── fasta │   ├── 1.fa │   ├── 2.fa │   ├── 3.fa │   ├── 4.fa │   ├── 5.fa │   ├── Mt.fa │   └── Pt.fa ├── get ├── infos.csv ├── infos.txt ├── labels.txt └── seqs ├── 1.seq ├── 2.seq ├── 3.seq ├── 4.seq ├── 5.seq ├── Mt.seq └── Pt.seq

+end_src

* =infos.csv= This read-only CSV file contains a list of all the fragments present in the mounted FASTA file, with, for each of them, the standard =id= and =additional informations= field, plus a third one containing the length of the sequence. =infos.txt= This read-only text file provides the same informations, but in a more human-readable format. =labels.txt= This read-only file contains a list of all the sequence headers present in the mounted FASTA file. =fasta= This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read-only FASTA files. =seqs= This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read/write files containing only the sequences - without the FASTA headers, but with any newline preserved. These files can be read, copied, removed, edited, etc. as normal files, and any alteration will be reflected on the original FASTA file when fusta is closed. =append= This folder should be used to add new sequences to the mounted FASTA file. Any valid fasta file copied or moved to this directory will be appended to the original FASTA files. It should be noted that the process is completely transparent and the the folder will remain empty, even though the operation is successful. =get= This folder is used for range-access to the sequences in the mounted FASTA file. Although it is empty, any read access to a (non-existing) file following the pattern =SEQID:START-END= will return the corresponding range (0-indexed) in the specified sequence. It should be noted that the access skip headers and newlines, so that the =START-END= coordinates map to actual loci in the corresponding sequence and not to bytes in the mounted FASTA file. * Examples All the following examples assume that a FASTA file has been mounted (/e.g./ =fusta -D genome.fa=), and is unmounted after manipulation (/e.g./ =fusermount -u fusta=). * Get an overview of the file content #+begin_src shell cat fusta/infos.txt #+end_src Extract individual sequences as FASTA files #+beginsrc shell cat fusta/fasta/chr{X,Y}.fa > ~/sexchrs.fa #+endsrc * Extract a range of chromosome 12 #+beginsrc shell cat fusta/get/chr12:12000000-12002000 #+endsrc * Remove sequences from the original file #+beginsrc shell rm fusta/seq/chr{3,5}.seq #+endsrc * Add a new sequence #+beginsrc shell cp moresequences.fa fusta/append #+endsrc * Edit the mitochondrial genome #+beginsrc shell nano fusta/seq/chrMT.seq #+endsrc Batch-rename chromosomes #+begin_src shell cd fusta/seq; for i in *; do mv ${i} chr${i}; done #+end_src ** Use independent sequences in external programs #+beginsrc shell blastn mydb.db -query fusta/fasta/seq25.fa asgart fusta/fasta/chrX.fa fusta/asgart/chrY.fa --out result.json #+endsrc * Compressed FASTA files FUSTA only works with uncompressed (multi)FASTA files. If you wish to use FUSTA on compressed (multi)FASTA files, we recommend to use [[https://github.com/yhoogstrate/fastafs][FASTAFS]] as an intermediary to expose a compressed (multi)FASTA file to FUSTA without requiring to ully uncompress it. * Runtime options

+begin_src

USAGE: fusta [OPTIONS]

OPTIONS: --cache Use either mmap, fseek(2) or memory-backed cache to extract sequences from FASTA files. WARNING: memory caching use as much RAM as the size of the FASTA file should be available. [default: mmap] [possible values: file, mmap, memory] -h, --help Prints help information -C, --max-cache Set the maximum amount of memory to use to cache writes (MB) [default: 500] -o, --mountpoint Specifies the directory to use as mountpoint; it will be created if it does not exist [default: fusta] -D, --no-daemon Do not daemonize -v Sets the level of verbosity -V, --version Prints version information

ARGS: A (multi)FASTA file containing the sequences to moun

+end_src

* =--cache= The cache option is key in adapting FUSTA to your use, and for files of non-trivial size, a correct choice is the difference between a memory overflow and a smooth run: - =file= :: in this mode, FUSTA store all the fragments as offsets in their file, and access them through =fseek= accesses. The performances will probably be the worse, but memory consumption will be kept to the minimal. - =mmap= :: this mode is extremely similar to the previous one, safe that access will proceed through [[https://en.wikipedia.org/wiki/Mmap][mmmap(2)]] reads, leveraging the caching facilities of the OS -- this is the default mode. - =memory= :: in this mode, all fragments will directly be copied to memory. Performances will be at their best, but enough memory should be available to store the entirety of the processed files. * Troubleshooting I get a "Cannot allocate memory" error The FASTA files may be overflowing the default setting of the memory overcommit guard. You may change the overcommiting setting with =sysctl -w vm.overcommit_memory 1=, or use =--cache=file= for less performances, but less virtual memory pressure. I still get a "Cannot allocate memory" error Your FASTA file may contain too many fragments w.r.t. the number of mmap pages that can be mapped by a program. You may increase =maxmapcount= with =sysctl -w vm.maxmapcount 200000=, or use =--cache=file= for less performances, but less virtual memory pressure. I have another error [[https://github.com/delehef/fusta/issues][Open an issue stating your problem!]] * Contact If you have any question or if you encounter a problem, do not hesitate to [[https://github.com/delehef/fusta/issues][open an issue]]. * Acknowledgments FUSTA is standing on the shoulders of, among others, [[https://github.com/cberner/fuser][fuser]], [[https://github.com/clap-rs/clap][clap]], [[https://github.com/RazrFalcon/memmap2-rs][memmap2]] and [[https://github.com/knsd/daemonize][daemonize]]. * Changelog * v1.5.4 - Fix mountpoint not being created * v1.5.3 - Improve notification system * v1.5.2 - Daemonize by default * v1.5.1 - Update memmap to memmap2 * v1.5 - Add an index on ino for better performances at the cost of a bit of memory * v1.4 - Daemonize /after/ parsing FASTA files, so that (i) errors appear immediately and (ii) performances are better when launching multiple instances in parallel. * v1.3 - Can now cache all fragments in memory: increased RAM consumption, but starkly reduced random access time * v1.2.1 - Bugfixes * v1.2 - FUSTA is now based on fuster instead of fuse-rs - Various optimization let FUSTA handle >40GB FASTA files in 6GB of RAM and much better performances - Added an optional notification system behind the =notifications= feature gate * v1.1.1 - Use MMAP by default. While it may lead to unpleasant load when performing heavy operation on very large files, this should be a rather uncommon case. * v1.1 - FUSTA can now directly extract ranges from a sequence ** v1.0 - Initial release