stam logo

Crate GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

STAM Tools

A collection of command-line tools for working with STAM.

Various tools are grouped under the stam tool, and invoked with a subcommand:

For many of these, you can set --verbose for extra details in the output.

Installation

From source

$ cargo install stam-tools

Usage

Add the --help flag after the subcommand for extensive usage instructions.

Most tools take as input a STAM JSON file containing an annotation store. Any files mentioned via the @include mechanism are loaded automatically.

Instead of passing STAM JSON files, you can read from stdin and/or output to stdout by setting the filename to -, this works in many places.

These tools also support reading and writing STAM CSV.

Tools

stam export

The stam export tool is used to export STAM data into a tabular data format (TSV, tab separated values). You can configure precisely what columns you want to export using the --colums parameter. See stam export --help for a list of supported columns.

One of the more powerful functions is that you can specify custom columns by specifying a set ID, a delimiter and a key ID (the delimiter by default is a slash), for instance: my_set/part_of_speech. This will then output the corresponding value in that column, if it exist.

This export function is not lossless, that is, it can not encode everything that STAM supports, unlike STAM JSON and STAM CSV. It does, however, give you a great deal of flexibility to quickly output only the data relevant for whatever your specific purpose is.

stam import

The stam import tool is used to import tabular data from a TSV (Tab Separated Values) file into STAM. Like stam export, you can configure precisely what columns you want to import, using the --columns parameter. By default, the import function will attempt to parse the first line of your TSV file as the header and use that to figure out the column configuration. You will often want to set --annotationset to set a default annotation set to use for custom columns. If you set --annotationset my_set then a column like part_of_speech will be interpreted in that set (same as if you wrote my_set/part_of_speech explicitly).

Here is a simple example of a possible import TSV file (with --annotationset my_set):

tsv Text TextResource BeginOffset EndOffset part_of_speech Hello hello.txt 0 5 interjection world hello.txt 6 10 noun

The import function has some special abilities. If your TSV data does not mention specific offsets in a text resource(s), they will be looked up automatically during the import procedure. If the text resources don't even exist in the first place, they can be reconstructed (within certain constraints, the output text will likely be in tokenised form only). If your data does not explicitly reference a resource, use the --resource parameter to point to an existing resource that will act as a default, or --new-resource for the reconstruction behaviour.

By setting --resource hello.txt or --new-resource hello.txt you can import the following much more minimal TSV:

tsv Text part_of_speech Hello interjection world noun

The importer supports empty lines within the TSV file. When reconstructing text, these will map to (typically) a newline in the to-be-constructed text (this configurable with --outputdelimiter2). Likewise, the delimiter between rows is configurable with --outputdelimiter, and defaults to a space.

Note that stam import can not import everything it can itself export. It can only import rows exported with --type Annotation (the default), in which each row corresponds with one annotation.

stam tag

The stam tag tool can be used for matching regular expressions in text and subsequently associated annotations with the found results. It is a tool to do for example tokenization or other tagging tasks.

The stam tag command takes a TSV file (example) containing regular expression rules for the tagger. The file contains the following columns:

  1. The regular expressions follow the this syntax. The expression may contain one or or more capture groups containing the items that will be tagged, in that case anything else is considered context and will not be tagged.
  2. The ID of annotation data set
  3. The ID of the data key
  4. The value to set. If this follows the syntax $1,$2,etc.. it will assign the value of that capture group (1-indexed).

Example:

```tsv

EXPRESSION #ANNOTATIONSET #DATAKEY #DATAVALUE

\w+(?:[-_]\w+)* simpletokens type word [.\?,/]+ simpletokens type punctuation [0-9]+(?:[,.][0-9]+) simpletokens type number ```