tacl excise¶

usage: tacl excise [-h] [-v] [-t {cbeta,latin,pagel}]
                   NGRAMS REPLACEMENT OUTPUT CORPUS WORK [WORK ...]

Output witness files for each specified work with all of the specified n-grams
replaced with the supplied replacement text. The replacement is done for each
n-gram in turn, in descending order of n-gram length.

positional arguments:
  NGRAMS                Path to file containing n-grams (one per line) to be
                        replaced.
  REPLACEMENT           Text to replace n-grams with. This should be one or
                        more valid tokens.
  OUTPUT                Path to directory to output transformed files to.
  CORPUS                Path to corpus.
  WORK                  Work whose witnesses will be transformed.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity.
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters).
tacl excise¶

TACL

Navigation

Related Topics