pithy 0.1.0 - an absurdly fast, strangely accurate, summariser

Something important to note is that pithy is more of a highlighter than a summariser. It just so happens that the most important sentences in a text are often good summaries. You can control this via the --density flag.

Here are some examples of what it outputs:

https://plato.stanford.edu/entries/chinese-room/, The Chinese Room Argument

https://www.gutenberg.org/files/55/55-0.txt, The Wonderful Wizard of Oz

https://archive.org/stream/ProgrammingRust1stEdition1491927283/Programming%20Rust%201st%20Edition%201491927283_djvu.txt, "Programming Rust 1st Edition"

https://www.gutenberg.org/cache/epub/5827/pg5827.txt, The Problems of Philosophy by Bertrand Russell

Quick example: pithy -f your_file_here.txt --sentences 4

--help:

Print this help message

-f:

The file pithy will read from. Required.

--sentences:

The number of sentences for pithy to return. Defaults to 3.

--approximate:

Will return a decent approximation of the summary. Good
for extremely long texts where you don't care about precision.

--bias:

slash (i.e \"/\") separated list of words to bias the summary towards.
If you are using pithy on a large text, increase the chunk_size to
2500-5000 to get relevant results. Note that this doesn't work in
approximate mode.

--bias_strength:

The strength of the bias, must be an integer. Defaults to 6.

--by_section:

If set, pithy splits the text into sections, and each section is
summarized separately. Defaults to false.

--chunk_size:

The number of sentences to read at a time. Defaults to 500 
if unspecified.

--force_all:

If set, pithy reads the text all at once. Can be quite 
slow once you go past the 7k mark. Defaults to false.

--force_chunk:

If set, regardless of how large the text is, pithy splits it
into chunks. Should be used in combination with chunk_size 
and by_section.

--ngrams:

If set, pithy uses ngrams rather than words. 
It's usually crap, but you might use it as a last resort 
for non-spaced languages that you can't pre-tokenise. 
Defaults to false.

--min_length:

The minimum sentence length before filtering. Defaults to 30.

--max_length:

The maximum sentence length before filtering. Defaults to 1500.

--separator:

The separator used to split the text into sentences. 
Defaults to '. '. You can type newline to separate by newlines.

--clean_whitespace:

If set, removes sentences with excessive whitespace. Useful for 
pdfs and copy-pastes from websites.

--clean_nonalphabetic:

If set, removes sentences with too many non-alphabetic characters.

--clean_caps:

If set, removes sentences with too many capital letters. Useful 
if the text contains a lot of references or indices.

--length_penalty:

The length penalty. Defaults to 1.5. Decrease to make glance for longer 
sentences, increase for shorter sentences.

--density:

Experimental setting. Defaults to 3. Setting it lower 
seems to bias pithy's summaries towards more common words, 
setting it higher seems to bias summaries towards rarer 
but more informative words.

--no_context:

If set, the context surrounding sentences isn't provided. 
Defaults to false.

--relevance:

If set, the sentences are sorted by their relevance rather 
than their order in the original text. Defaults to false.

--nobar:

If set, the progress bar is not printed. Defaults to false because
progress bars are cool.