scrubcsv: Remove bad lines from a CSV file and normalize the rest

This is a CSV cleaning tool based on BurntSushi's excellent csv library. It's intended to be used for cleaning up and normalizing large data sets before feeding them to other CSV parsers, at the cost of discarding the occasional row. This program may further mangle syntactically-invalid CSV data! See below for details.

Installing and using

To install, first install Rust if you haven't already:

sh curl https://sh.rustup.rs -sSf | sh

Then install scrubcsv using Cargo:

sh cargo install scrubcsv

Run it:

sh $ scrubcsv giant.csv > scrubbed.csv 3000001 rows (1 bad) in 51.58 seconds, 72.23 MiB/sec

For more options, run:

sh scrubcsv --help

Data cleaning notes

We assume that, given hundreds of gigabytes of CSV from many sources, many files will contain a few unparsable lines.

Lines of the following form:

csv Name,Phone "Robert "Bob" Smith",(202) 555-1212

...are invalid according the RFC 4180 because the quotes around "Bob" are not escaped. The creator the file probably intended to write:

csv Name,Phone "Robert ""Bob"" Smith",(202) 555-1212

scrubcsv will currently output this as:

csv Name,Phone "Robert Bob"" Smith""",(202) 555-1212

If the resulting line has the wrong number of columns, it will be discarded. The precise details of cleanup and discarding are subject to change. The goal is to preserve data in valid CSV files, and to make a best effort to salvage or discard records that can't be parsed without being too picky about the details.

Performance notes

This is designed to be relatively fast. For comparison purposes, on particular laptop:

cat /dev/zero | pv > /dev/null shows a throughput of about 5 GB/s.
The original raw output string-writing routines in scrubcsv could reach about 3.5 GB/s.
The csv parser can reach roughly 235 MB/s in zero-copy mode.
With various levels of processing, scrubcsv hits 49 to 125 MB/s.
A lot of old-school C command-line tools hit about 50 to 75 MB/s.

Unfortunately, we can't really use csv's zero-copy mode because we need to see an entire row at once to decide whether or not it's valid before deciding to output it. We could, I suppose, memmove each field as we see it into an existing buffer to avoid malloc overhead (which is almost certianly the bottleneck here), but that would require more code. Still, file an issue if performance is a problem. We could probably make this a maybe two to four times faster (and it would be fun to optimize).