usage: tacl search [-h] [-v] [-m] [-r RAM] [-t {cbeta,latin,pagel}]
DATABASE CORPUS CATALOGUE [NGRAMS ...]
Output results of searching the database for the supplied n-grams that occur
within labelled witnesses.
positional arguments:
DATABASE Path to database file.
CORPUS Path to corpus.
CATALOGUE Path to catalogue file.
NGRAMS Path to file containing list of n-grams to search for,
with one n-gram per line. (default: None)
options:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-m, --memory Use RAM for temporary database storage.
This may cause an out of memory error, in which case
run the command without this switch. (default: False)
-r RAM, --ram RAM Number of gigabytes of RAM to use. (default: 3)
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
If multiple paths to files containing n-grams are given, the combined set of
n-grams from all files will be searched for.
If no path is given, the results will include all n-grams found for all of the
labelled witnesses in the catalogue.
Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".