usage: tacl excise [-h] [-v] [-t {cbeta,latin,pagel}]
NGRAMS REPLACEMENT OUTPUT CORPUS WORK [WORK ...]
Output witness files for each specified work with all of the specified n-grams
replaced with the supplied replacement text. The replacement is done for each
n-gram in turn, in descending order of n-gram length.
positional arguments:
NGRAMS Path to file containing n-grams (one per line) to be
replaced.
REPLACEMENT Text to replace n-grams with. This should be one or
more valid tokens.
OUTPUT Path to directory to output transformed files to.
CORPUS Path to corpus.
WORK Work whose witnesses will be transformed.
options:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity.
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters).