ETL

Build Status

This package is general-purpose Extract-Transform-Load (ETL) library for Rust, built to load arbitrary plain text files into data frame objects.

Features: * Delimiter speification (comma, tab, etc.) * Data types: * Signed / unsigned integers * Floating point numbers * Text fields * Boolean values * Transformations: * Concatenation (of text fields) * Mapping (from one text field to another) * Conversion between types * Scaling of values (for numeric values, e.g. between -1 and 1) * Normalization of values * Vectorization (one-hot or feature hashing) * Filtering

Configuration is handled through a TOML file. For example: ```toml

data_config.toml

[[sourcefiles]] name = "source1.csv" delimiter = "," fields = [ { sourcename = "atextfield", fieldtype = "Text", addtoframe = false }, { sourcename = "anothertextfield", fieldtype = "Text", addto_frame = false } ]

[[sourcefiles]] name = "sourc2.tsv" delimiter = "\t" fields = [ { sourcename = "aninteger", fieldtype = "Signed" }, { sourcename = "anotherinteger", fieldtype = "Signed" }, { sourcename = "acategory", fieldtype = "Text" }, { sourcename = "anunusedfloat", fieldtype = "Float", addtoframe = false } ]

[[transforms]] method = { action = "Concatenate", separator = " & " } sourcefields = [ "atextfield", "anothertextfield" ] targetname = "anewtext_field"

[[transforms]] sourcefields = [ "acategory" ] targetname = "categorymappedtointegers"

[transforms.method] action = "Map" defaultvalue = "-1" map = { "firstcategory" = "0", "secondcategory" = "1" } To load a configuration file: rust let datapath = PathBuf::from(file!()).parent().unwrap().join("data_config.toml");

let (config, df) = DataFrame::load(datapath.aspath()).unwrap();

let mut fieldnames = df.fieldnames(); fieldnames.sort(); asserteq!(fieldnames, ["acategory", "anewtextfield", "aninteger", "anotherinteger" "categorymappedtointegers"]); ```

Once loaded, files can be transformed into a matrix for further processing. rust let (config, df) = DataFrame::load(data_path.as_path()).unwrap(); let (fieldnames, mat) = df.as_matrix().unwrap();