Manipulating key-value csv files

January 9, 2018 by Mike Ledger

This post is about a command-line tool for manipulating map-like CSV data, where the first column is treated as a “key” for that row. This post is largely the same as the README available at https://gitlab.com/transportengineering/csvmaps#unions; except that it’ll (extremely slowly) get buried under other posts I make. I also self-plagiarized the same README again on the “project” page for csvmaps, also on this website.

Installation

Currently requires stack because of a dependency on an unreleased (or rather, also not on hackage) package cassava-streaming.

$ git clone git@gitlab.com:transportengineering/csvmaps.git
$ cd csvmaps
$ stack install

Motivation

The use case that motivated it was to manipulate a dictionary of terms from a project that we, Transport Engineering were tasked to clean up, with very little time to actually do it.

The dictionary was simply all the title-case(*) terms that we could automatically pull out of a particular version of the project. This lead to there being thousands of spurious terms (e.g., from words at the beginning of a sentence), and terms that were only partially detected, and a proportional amount of manual work to cull or edit them.

The tool allowed me to:

Delete the same terms deleted in version A in version A+1
For terms that were edited rather than deleted in version A, replace the non-edited version from A+1 (if it existed) with the version from A
Detect entirely new terms
Combine the official project dictionaries with the title-case one I’d produced

(*): It also included some joining words like “and”, “or”, “for”, “for” etc.

Usage

$ csvmaps --help
Usage: csvmaps [-i|--infiles STRING]... [-o|--outfile STRING]
               [--has-header BOOL] [--op MAPOP]
               [--expr MAPEXPR] [--save-labels BOOL]
               [--verbose BOOL]

Available options:
  -h,--help                Show this help text
  -i,--infiles STRING...   Files to combine. The first column of each csv will
                           be used as a key. For non-csv files ( iles that don't
                           end in .csv), each line will be treated as a key.
  -o,--outfile STRING      .csv file to write output to. If omitted, use stdout.
  --has-header BOOL        whether to ignore header in csv files
  --op MAPOP               The operation to use to combine csv files (default:
                           union)
  --expr MAPEXPR
                           The expression to use to combine csv files. Reference
                           the inputs with $N for the Nth document in the
                           "infiles". The operations available are: 
                           1. Union with the + operator; 
                           2. Difference with the - operator; 
                           3. Intersection with the * operator; 
                           4. Union combining all columns with the +. operator; 
                           5. Intersection combining all columns with the *. 
                              operator 
                           6. Labels for expressions with the syntax
                              '"LABEL": EXPR'.

Examples

Take the rows that exist in both v1, in addition (preferring v1) to the rows that exist in both v1 and v2 (preferring v2).

$ csvmaps -i v1.csv -i v2.csv --expr '(($1 *| $2) +| ($1 - $2))'

MAPEXPR syntax

$N: References the Nth document specified, starting at 1.
keys A: Discards the values of all rows.
nulls A: Discards the rows that have non-empty values.
non-nulls A: Discards the rows that have empty values.
A : "label": Assigns a label to the expression A. When used with the --save-labels option, this will also result in a file called label.csv being created using A.
const ["Col_1","Col_2",...] A: Replaces all values of A with the columns given in the first argument.
pad N A: Ensures the number of columns after each key is at least N; fills ones that don’t exist with empty strings.
col N A: Takes the Nth column.
col [N1,N2,...] A: Uses the columns given.

Unions

A + B: Left-biased union of A and B. Same as A + B. When keys exist in both maps, the values at those keys are taken from the left operand.
A |+ B: Left-biased union of A and B. Same as A + B. When keys exist in both maps, the left operand’s values at those keys are used, unless they are empty.
A +| B: Right-biased union of A and B. Same as A +| B. When keys exist in both maps, the right operand’s values at those keys are used, unless they are empty.
A |+| B: Union of A and B that concatenates the values of all keys that both operands have in common.

Intersections

A * B: Left-biased intersection of A and B. Same as A * B. When keys exist in both maps, the values at those keys are taken from the left operand.
A |* B: Left-biased intersection of A and B. Same as A * B. When keys exist in both maps, the left operand’s values at those keys are used, unless they are empty.
A *| B: Right-biased intersection of A and B. Same as A *| B. When keys exist in both maps, the right operand’s values at those keys are used, unless they are empty.
A |*| B: Intersection of A and B that concatenates the values of all keys that both operands have in common.

Difference

A - B: Returns the rows of A whose keys are not in B.