This package is general-purpose Extract-Transform-Load (ETL) library for Rust, built to load arbitrary plain text files into data frame objects.
Features: * Delimiter speification (comma, tab, etc.) * Data types: * Signed / unsigned integers * Floating point numbers * Text fields * Boolean values * Transformations: * Concatenation (of text fields) * Mapping (from one text field to another) * Conversion between types * Scaling of values (for numeric values, e.g. between -1 and 1) * Normalization of values * Vectorization (one-hot or feature hashing) * Filtering
Configuration is handled through a TOML file. For example: ```toml
[[sourcefiles]] name = "source1.csv" delimiter = "," fields = [ { sourcename = "atextfield", fieldtype = "Text", addtoframe = false }, { sourcename = "anothertextfield", fieldtype = "Text", addto_frame = false } ]
[[sourcefiles]] name = "sourc2.tsv" delimiter = "\t" fields = [ { sourcename = "aninteger", fieldtype = "Signed" }, { sourcename = "anotherinteger", fieldtype = "Signed" }, { sourcename = "acategory", fieldtype = "Text" }, { sourcename = "anunusedfloat", fieldtype = "Float", addtoframe = false } ]
[[transforms]] method = { action = "Concatenate", separator = " & " } sourcefields = [ "atextfield", "anothertextfield" ] targetname = "anewtext_field"
[[transforms]] sourcefields = [ "acategory" ] targetname = "categorymappedtointegers"
[transforms.method]
action = "Map"
defaultvalue = "-1"
map = { "firstcategory" = "0", "secondcategory" = "1" }
To load a configuration file:
rust
let datapath = PathBuf::from(file!()).parent().unwrap().join("data_config.toml");
let (config, df) = DataFrame::load(datapath.aspath()).unwrap();
let mut fieldnames = df.fieldnames(); fieldnames.sort(); asserteq!(fieldnames, ["acategory", "anewtextfield", "aninteger", "anotherinteger" "categorymappedtointegers"]); ```
Once loaded, files can be transformed into a matrix for further processing.
rust
let (config, df) = DataFrame::load(data_path.as_path()).unwrap();
let (fieldnames, mat) = df.as_matrix().unwrap();